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Preface 



In the winter of 2010, I taught a topics graduate course on random 
matrix theory, the lecture notes of which then formed the basis for 
this text. This course was inspired by recent developments in the 
subject, particularly with regard to the rigorous demonstration of 
universal laws for eigenvalue spacing distributions of Wigner matri- 
ces (see the recent survey [Gu2009b]). This course does not directly 
discuss these laws, but instead focuses on more foundational topics 
in random matrix theory upon which the most recent work has been 
based. For instance, the first part of the course is devoted to basic 
probabilistic tools such as concentration of measure and the central 
limit theorem, which are then used to establish basic results in ran- 
dom matrix theory, such as the Wigner semicircle law on the bulk 
distribution of eigenvalues of a Wigner random matrix, or the cir- 
cular law on the distribution of eigenvalues of an iid matrix. Other 
fundamental methods, such as free probability, the theory of deter- 
minantal processes, and the method of resolvents, are also covered in 
the course. 

This text begins in Chapter 1 with a review of the aspects of prob- 
ability theory and linear algebra needed for the topics of discussion, 
but assumes some existing familiarity with both topics, as well as a 
first-year graduate-level understanding of measure theory (as covered 
for instance in my books [Ta2011, Ta2010]). If this text is used 
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to give a graduate course, then Chapter 1 can largely be assigned as 
reading material (or reviewed as necessary), with the lectures then 
beginning with Section 2.1. 

The core of the book is Chapter 2. While the focus of this chapter 
is ostensibly on random matrices, the first two sections of this chap- 
ter focus more on random scalar variables, in particular discussing 
extensively the concentration of measure phenomenon and the cen- 
tral limit theorem in this setting. These facts will be used repeatedly 
when we then turn our attention to random matrices, and also many 
of the proof techniques used in the scalar setting (such as the moment 
method) can be adapted to the matrix context. Several of the key 
results in this chapter are developed through the exercises, and the 
book is designed for a student who is willing to work through these 
exercises as an integral part of understanding the topics covered here. 

The material in Chapter 3 is related to the main topics of this 
text, but is optional reading (although the material on Dyson Brow- 
nian motion from Section 3.1 is referenced several times in the main 
text) . 
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1. Preparatory material 



1.1. A review of probability theory 

Random matrix theory is the study of matrices whose entries are ran- 
dom variables (or equivalently, the study of random variables which 
take values in spaces of matrices) . As such, probability theory is an 
obvious prerequisite for this subject. As such, we will begin by quickly 
reviewing some basic aspects of probability theory that we will need 
in the sequel. 

We will certainly not attempt to cover all aspects of probability 
theory in this review. Aside from the utter foundations, we will be 
focusing primarily on those probabilistic concepts and operations that 
are useful for bounding the distribution of random variables, and on 
ensuring convergence of such variables as one sends a parameter n off 
to infinity. 

We will assume familiarity with the foundations of measure the- 
ory, which can be found in any text book (including my own text 
[Ta2011]). This is also not intended to be a first introduction to 
probability theory, but is instead a revisiting of these topics from a 
graduate-level perspective (and in particular, after one has under- 
stood the foundations of measure theory). Indeed, it will be almost 
impossible to follow this text without already having a firm grasp of 
undergraduate probability theory. 

1.1.1. Foundations. At a purely formal level, one could call prob- 
ability theory the study of measure spaces with total measure one, 
but that would be like calling number theory the study of strings 
of digits which terminate. At a practical level, the opposite is true: 
just as number theorists study concepts (e.g. primality) that have 
the same meaning in every numeral system that models the natural 
numbers, we shall see that probability theorists study concepts (e.g. 
independence) that have the same meaning in every measure space 
that models a family of events or random variables. And indeed, just 
as the natural numbers can be defined abstractly without reference 
to any numeral system (e.g. by the Peano axioms), core concepts of 
probability theory, such as random variables, can also be defined ab- 
stractly, without explicit mention of a measure space; we will return 
to this point when we discuss free probability in Section 2.5. 



1.1. A review of probability theory 



3 



For now, though, we shall stick to the standard measure-theoretic 
approach to probability theory. In this approach, we assume the pres- 
ence of an ambient sample space ft, which intuitively is supposed to 
describe all the possible outcomes of all the sources of randomness 
that one is studying. Mathematically, this sample space is a proba- 
bility space Q — (Q,B,P) - a set f2, together with a a -algebra B of 
subsets of ft (the elements of which we will identify with the proba- 
bilistic concept of an event), and a probability measure P on the space 
of events, i.e. an assignment E i— > P(E) of a real number in [0, 1] to 
every event E (known as the probability of that event), such that 
the whole space ft has probability 1, and such that P is countably 
additive. 

Elements of the sample space ft will be denoted u. However, for 
reasons that will be explained shortly, we will try to avoid actually 
referring to such elements unless absolutely required to. 

If we were studying just a single random process, e.g. rolling 
a single die, then one could choose a very simple sample space - in 
this case, one could choose the finite set {1,...,6}, with the dis- 
crete cr-algebra 2't 1 '"-' 6 ^ := {A : A C {1,...,6}} and the uniform 
probability measure. But if one later wanted to also study addi- 
tional random processes (e.g. supposing one later wanted to roll a 
second die, and then add the two resulting rolls), one would have to 
change the sample space (e.g. to change it now to the product space 
{1, . . . , 6} x {1, . . . , 6}). If one was particularly well organised, one 
could in principle work out in advance all of the random variables one 
would ever want or need, and then specify the sample space accord- 
ingly, before doing any actual probability theory. In practice, though, 
it is far more convenient to add new sources of randomness on the 
fly, if and when they are needed, and extend the sample space as nec- 
essary. This point is often glossed over in introductory probability 
texts, so let us spend a little time on it. We say that one probability 
space (ft' ,B' ,V') extends 1 another (ft,B, V) if there is a surjective 
map 7r : ft' — > ft which is measurable (i.e. n^ 1 (E) e B' for every 
E e B) and probability preserving (i.e. P'( , k~ 1 (E)) = P(E) for every 



Strictly speaking, it is the pair ((H', B' , V'), 7r) which is the extension of 
($1,13, V), not just the space (Q f ,B f ,V f ), but let us abuse notation slightly here. 
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E e B). By definition, every event E in the original probability space 
is canonically identified with an event tt^ 1 (E) of the same probability 
in the extension. 

Example 1.1.1. As mentioned earlier, the sample space {1, . . . , 6}, 
that models the roll of a single die, can be extended to the sample 
space {1, . . . , 6} x {1, . . . , 6} that models the roll of the original die 
together with a new die, with the projection map tt : {1, . . . , 6} x 
{1, . . . , 6} -> {1, . . . , 6} being given by n(x, y) := x. 

Another example of an extension map is that of a permutation - 
for instance, replacing the sample space {1, . . . , 6} by the isomorphic 
space {a, . . . , /} by mapping a to 1, etc. This extension is not actually 
adding any new sources of randomness, but is merely reorganising the 
existing randomness present in the sample space. 

In order to have the freedom to perform extensions every time we 
need to introduce a new source of randomness, we will try to adhere 
to the following important dogma 2 : probability theory is only 
"allowed" to study concepts and perform operations which 
are preserved with respect to extension of the underlying 
sample space. As long as one is adhering strictly to this dogma, 
one can insert as many new sources of randomness (or reorganise 
existing sources of randomness) as one pleases; but if one deviates 
from this dogma and uses specific properties of a single sample space, 
then one has left the category of probability theory and must now 
take care when doing any subsequent operation that could alter that 
sample space. This dogma is an important aspect of the probabilistic 
way of thinking, much as the insistence on studying concepts and 
performing operations that are invariant with respect to coordinate 
changes or other symmetries is an important aspect of the modern 
geometric way of thinking. With this probabilistic viewpoint, we shall 
soon see the sample space essentially disappear from view altogether, 
after a few foundational issues are dispensed with. 



This is analogous to how differential geometry is only "allowed" to study con- 
cepts and perform operations that are preserved with respect to coordinate change, or 
how graph theory is only "allowed" to study concepts and perform operations that are 
preserved with respect to relabeling of the vertices, etc.. 
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Let us now give some simple examples of what is and what is 
not a probabilistic concept or operation. The probability P(-E) of 
an event is a probabilistic concept; it is preserved under extensions. 
Similarly, boolean operations on events such as union, intersection, 
and complement are also preserved under extensions and are thus 
also probabilistic operations. The emptiness or non-emptiness of an 
event E is also probabilistic, as is the equality or non-equality 3 of two 
events E, F. On the other hand, the cardinality of an event is not a 
probabilistic concept; for instance, the event that the roll of a given 
die gives 4 has cardinality one in the sample space {1,...,6}, but 
has cardinality six in the sample space {1, . . . , 6} x {1, . . . , 6} when 
the values of an additional die are used to extend the sample space. 
Thus, in the probabilistic way of thinking, one should avoid thinking 
about events as having cardinality, except to the extent that they are 
either empty or non-empty. 

Indeed, once one is no longer working at the foundational level, 
it is best to try to suppress the fact that events are being modeled as 
sets altogether. To assist in this, we will choose notation that avoids 
explicit use of set theoretic notation. For instance, the union of two 
events E, F will be denoted EV F rather than E U F, and will often 
be referred to by a phrase such as "the event that at least one of E 
or F holds" . Similarly, the intersection EC\F will instead be denoted 
E A F, or "the event that E and F both hold" , and the complement 
£l\E will instead be denoted E, or "the event that E does not hold" 
or "the event that E fails" . In particular the sure event can now be 
referred to without any explicit mention of the sample space as 0. We 
will continue to use the subset notation E C F (since the notation 
E < F may cause confusion), but refer to this statement as 11 E is 
contained in F" or "E implies F" or "E holds only if F holds" rather 
than "E is a subset of F" , again to downplay the role of set theory 
in modeling these events. 

We record the trivial but fundamental union bound 



^Notc how it was important here that we demanded the map 7r to be surjective 
in the definition of an extension. 



(1.1) 
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for any finite or countably infinite collection of events Ei. Taking 
complements, we see that if each event Ei fails with probability at 
most Ei, then the joint event /\ i Ei fails with probability at most 
J2i £ i- Thus, if one wants to ensure that all the events Ei hold at once 
with a reasonable probability, one can try to do this by showing that 
the failure rate of the individual Ei is small compared to the number 
of events one is controlling. This is a reasonably efficient strategy so 
long as one expects the events Ei to be genuinely "different" from 
each other; if there are plenty of repetitions, then the union bound is 
poor (consider for instance the extreme case when Ei does not even 
depend on i). 

We will sometimes refer to use of the union bound to bound 
probabilities as the zeroth moment method, to contrast it with the 
first moment method, second moment method, exponential moment 
method, and Fourier moment methods for bounding probabilities that 
we will encounter later in this course. 

Let us formalise some specific cases of the union bound that we 
will use frequently in the course. In most of this course, there will be 
an integer parameter n, which will often be going off to infinity, and 
upon which most other quantities will depend; for instance, we will 
often be considering the spectral properties ofnxn random matrices. 

Definition 1.1.2 (Asymptotic notation). We use X = 0(Y), Y = 
9.{X), X < Y, or Y > X to denote the estimate \X\ < CY for 
some C independent of n and all n> C. If we need C to depend on 
a parameter, e.g. C — Ck, we will indicate this by subscripts, e.g. 
X = O k (Y). We write X = o{Y) if \X\ < c(n)Y for some c that goes 
to zero as n -> oo. We write X ~ Y or X = 9(F) if X < Y < X. 

Given an event E = E n depending on such a parameter n, we 
have five notions (in decreasing order of confidence) that an event is 
likely to hold: 

(i) An event E holds surely (or is true) if it is equal to the sure 
event 0. 

(ii) An event E holds almost surely (or with full probability) if 
it occurs with probability 1: P(E) = 1. 
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(iii) An event E holds with overwhelming probability if, for every 
fixed A > 0, it holds with probability 1 — Oa (n~ A ) (i.e. one 
has P(E) > 1 - C A n- A for some C A independent of n). 

(iv) An event E holds with high probability if it holds with prob- 
ability 1 — 0(n~ c ) for some c > independent of n (i.e. one 
has P(E) > 1 — Cn~ c for some C independent of n). 

(v) An event E holds asymptotically almost surely if it holds 
with probability 1 — o(l), thus the probability of success 
goes to 1 in the limit n — > oo. 

Of course, all of these notions are probabilistic notions. 

Given a family of events E a depending on some parameter a, we 
say that each event in the family holds with overwhelming probability 
uniformly in a if the constant Ca in the definition of overwhelming 
probability is independent of a; one can similarly define uniformity 
in the concepts of holding with high probability or asymptotic almost 
sure probability. 

From the union bound (1.1) we immediately have 
Lemma 1.1.3 (Union bound). 

(i) // E a is an arbitrary family of events that each hold surely, 
then /\ a E a holds surely. 

(ii) If E a is an at most countable family of events that each hold 
almost surely, then /\ a E a holds almost surely. 

(iii) // E a is a family of events of polynomial cardinality ( i. e. 
cardinality 0{n ^)) which hold with uniformly overwhelm- 
ing probability, the f\ a E a holds with overwhelming proba- 
bility. 

(iv) // E a is a family of events of sub-polynomial cardinality ( i. e. 
cardinality 0(n ^)) which hold with uniformly high proba- 
bility, the /\ a E a holds with high probability. (In particular, 
the cardinality can be poly logarithmic in size, 

0(log 0(1) n)J 

(v) // E a is a family of events of uniformly bounded cardinality 
(i.e. cardinality 0(1)) which each hold asymptotically al- 
most surely, then /\ a E a holds asymptotically almost surely. 
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(Note that uniformity of asymptotic almost sureness is au- 
tomatic when the cardinality is bounded.) 

Note how as the certainty of an event gets stronger, the num- 
ber of times one can apply the union bound increases. In particular, 
holding with overwhelming probability is practically as good as hold- 
ing surely or almost surely in many of our applications (except when 
one has to deal with the entropy of an n-dimensional system, which 
can be exponentially large, and will thus require a certain amount of 
caution). 

1.1.2. Random variables. An event E can be in just one of two 
states: the event can hold or fail, with some probability assigned to 
each. But we will usually need to consider the more general class of 
random variables which can be in multiple states. 

Definition 1.1.4 (Random variable). Let R = (R,7Z) be a measur- 
able space (i.e. a set R, equipped with a cr-algebra of subsets of 72). A 
random variable taking values in R (or an R-valued random variable) 
is a measurable map X from the sample space to R, i.e. a function 
such that X^ 1 (S) is an event for every S G 72. 

As the notion of a random variable involves the sample space, 
one has to pause to check that it invariant under extensions before 
one can assert that it is a probabilistic concept. But this is clear: if 
X : £1 — > R is a random variable, and ir : £1' — > il is an extension of f2, 
then X' := X o ir is also a random variable, which generates the same 
events in the sense that (X / )- 1 (S) = Tr" 1 ^" 1 ^)) for every S G 72. 

At this point let us make the convenient convention (which we 
have in fact been implicitly using already) that an event is identified 
with the predicate which is true on the event set and false outside of 
the event set. Thus for instance the event X~ 1 (S) could be identified 
with the predicate "A G S"; this is preferable to the set-theoretic 
notation {uj G Q : X(lu) G S}, as it does not require explicit reference 
to the sample space and is thus more obviously a probabilistic notion. 
We will often omit the quotes when it is safe to do so, for instance 
P(A G S) is shorthand for P("A G S"). 
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Remark 1.1.5. On occasion, we will have to deal with almost surely 
defined random variables, which are only defined on a subset 0' of Q 
of full probability. However, much as measure theory and integration 
theory is largely unaffected by modification on sets of measure zero, 
many probabilistic concepts, in particular probability, distribution, 
and expectation, are similarly unaffected by modification on events of 
probability zero. Thus, a lack of defincdness on an event of probability 
zero will usually not cause difficulty, so long as there are at most 
countably many such events in which one of the probabilistic objects 
being studied is undefined. In such cases, one can usually resolve such 
issues by setting a random variable to some arbitrary value (e.g. 0) 
whenever it would otherwise be undefined. 

We observe a few key subclasses and examples of random vari- 
ables: 

(i) Discrete random variables, in which 72. = 2^ is the discrete 
c-algebra, and R is at most countable. Typical examples 
of R include a countable subset of the reals or complexes, 
such as the natural numbers or integers. If R = {0, 1}, 
we say that the random variable is Boolean, while if R is 
just a singleton set {c} we say that the random variable is 
deterministic, and (by abuse of notation) we identify this 
random variable with c itself. Note that a Boolean random 
variable is nothing more than an indicator function 1(E) of 
an event E, where E is the event that the boolean function 
equals 1. 

(ii) Real-valued random variables, in which R is the real line R 
and 1Z is the Borel a-algebra, generated by the open sets of 
R. Thus for any real-valued random variable X and any 
interval /, we have the events ll X G /" . In particular, wc 
have the upper tail event U X > A" and lower tail event 
U X < A" for any threshold A. (We also consider the events 
a X > A" and "X < A" to be tail events; in practice, there 
is very little distinction between the two.) 

(iii) Complex random variables, whose range is the complex plane 
C with the Borel a-algebra. A typical event associated 
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to a complex random variable X is the small ball event 
U \X — z\ < r" for some complex number z and some (small) 
radius r > 0. We refer to real and complex random variables 
collectively as scalar random variables. 

(iv) Given a R- valued random variable X, and a measurable map 
/ : R — > R', the i?'-valued random variable f(X) is indeed 
a random variable, and the operation of converting X to 
f(X) is preserved under extension of the sample space and 
is thus probabilistic. This variable f(X) can also be defined 
without reference to the sample space as the unique random 
variable for which the identity 

ll f(x) e 5" = u x e 

holds for all i?'-measurable sets S. 

(v) Given two random variables X\ and X 2 taking values in 
R\ , R 2 respectively, one can form the joint random variable 
(Xi, X 2 ) with range R\ x R 2 with the product a- algebra, by 
setting (X\,X 2 )(oj) :— (Xi(uj), X 2 (u)) for every u e O. One 
easily verifies that this is indeed a random variable, and that 
the operation of taking a joint random variable is a proba- 
bilistic operation. This variable can also be defined without 
reference to the sample space as the unique random variable 
for which one has Wi(X 1 , X 2 ) = X 1 and ir 2 (Xi, X 2 ) = X 2 , 
where tti : (xi,x 2 ) i-> X\ and ir 2 : (xi,x 2 ) x 2 are the 
usual projection maps from i?i x R 2 to R\,R 2 respectively. 
One can similarly define the joint random variable (X a ) Q ^^ 
for any family of random variables X a in various ranges R a 
(note here that the set A of labels can be infinite or even 
uncountable). 

(vi) Combining the previous two constructions, given any mea- 
surable binary operation / : R\ x R 2 — > R' and random vari- 
ables Xi , X 2 taking values in R\ , R 2 respectively, one can 
form the i?'-valued random variable f{X\,X 2 ) := f((Xi,X 2 )), 
and this is a probabilistic operation. Thus for instance one 
can add or multiply together scalar random variables, and 
similarly for the matrix-valued random variables that we 
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will consider shortly. Similarly for ternary and higher or- 
der operations. A technical issue: if one wants to perform 
an operation (such as division of two scalar random vari- 
ables) which is not defined everywhere (e.g. division when 
the denominator is zero). In such cases, one has to adjoin 
an additional "undefined" symbol T to the output range R' . 
In practice, this will not be a problem as long as all random 
variables concerned are defined (i.e. avoid T) almost surely. 

(vii) Vector-valued random variables, which take values in a finite- 
dimensional vector space such as R n or C™ with the Borel 
cr-algebra. One can view a vector-valued random variable 
X = (X\, . . . , X n ) as the joint random variable of its scalar 
component random variables X\, . . . , X n . 

(viii) Matrix-valued random variables or random matrices, which 
take values in a space M nxp (R) or M„ xp (C) of n x p real 
or complex-valued matrices, again with the Borel cr-algebra, 
where n,p > 1 are integers (usually we will focus on the 
square case n = p). Note here that the shape «xpof 
the matrix is deterministic; we will not consider in this 
course matrices whose shapes arc themselves random vari- 
ables. One can view a matrix-valued random variable X = 
(Xij)i<i< n: i<j< p as the joint random variable of its scalar 
components Xij. One can apply all the usual matrix oper- 
ations (e.g. sum, product, determinant, trace, inverse, etc.) 
on random matrices to get a random variable with the ap- 
propriate range, though in some cases (e.g with inverse) one 
has to adjoin the undefined symbol T as mentioned earlier. 

(ix) Point processes, which take values in the space yi(S) of sub- 
sets A of a space S (or more precisely, on the space of multi- 
sets of S, or even more precisely still as integer- valued locally 
finite measures on S), with the cr-algebra being generated by 
the counting functions |^4ni3| for all precompact measurable 
sets B. Thus, if X is a point process in S, and B is a pre- 
compact measurable set, then the counting function |Xni?| 
is a discrete random variable in {0, 1,2,.. .}U{+oo}. For us, 
the key example of a point process comes from taking the 
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spectrum {Ai, . . . , A„} of eigenvalues (counting multiplicity) 
of a random n x n matrix M n . Point processes are dis- 
cussed further in [Ta2010b, §2.6]. We will return to point 
processes (and define them more formally) later in this text. 

Remark 1.1.6. A pedantic point: strictly speaking, one has to in- 
clude the range R = (R, 1Z) of a random variable X as part of that 
variable (thus one should really be referring to the pair (X, R) rather 
than X). This leads to the annoying conclusion that, technically 
boolean random variables are not integer-valued, integer-valued ran- 
dom variables are not real- valued, and real- valued random variables 
are not complex- valued. To avoid this issue we shall abuse notation 
very slightly and identify any random variable X = (X, R) to any 
coextension (X, R') of that random variable to a larger range space 
R 1 D R (assuming of course that the a-algebras are compatible). 
Thus, for instance, a real-valued random variable which happens to 
only take a countable number of values will now be considered a dis- 
crete random variable also. 

Given a random variable X taking values in some range R, we 
define the distribution fix of X to be the probability measure on the 
measurable space R — (R, 1Z) defined by the formula 

(1.2) ^x(S) := P(X e S), 

thus \ix is the pushforward X*P of the sample space probability 
measure P by X. This is easily seen to be a probability measure, and 
is also a probabilistic concept. The probability measure fix is also 
known as the law for X. 

We write X = Y for fi x = fly', we also abuse notation slightly 
by writing X = [i X - 

We have seen that every random variable generates a probability 
distribution fix- The converse is also true: 

Lemma 1.1.7 (Creating a random variable with a specified dis- 
tribution). Let fi be a probability measure on a measurable space 
R = (R,1Z). Then (after extending the sample space fl if necessary) 
there exists an R-valued random variable X with distribution \i. 
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Proof. Extend ft to £1 x R by using the obvious projection map 
(cj,r) i y uj from Q x R back to Q,, and extending the probability 
measure P on il to the product measure P x /i on x R. The 
random variable X(uj,r) := r then has distribution fi. □ 

If X is a discrete random variable, [ix is the discrete probability 
measure 

(1-3) »x(S) = J2p* 

where p x := P(X = x) are non- negative real numbers that add up 
to 1. To put it another way, the distribution of a discrete random 
variable can be expressed as the sum of Dirac masses (defined below): 

(1-4) nx = ^2pxS x . 

We list some important examples of discrete distributions: 

(i) Dirac distributions 8 Xo , in which p x = 1 for x = .t and 
p x = otherwise; 

(ii) discrete uniform distributions, in which R is finite and p x = 
1/\R\ for all x e R; 

(iii) (Unsigned) Bernoulli distributions, in which R = {0, 1}, 
Pi = p, and po = 1 — P for some parameter < p < 1; 

(iv) The signed Bernoulli distribution, in which R = {— 1,+1} 
and p + i =p-i = 1/2; 

(v) Lazy signed Bernoulli distributions, in which R = { — 1, 0, +1}, 
p + i = p-i = [i/2, and p = 1 — [i for some parameter 
< n < 1; 

(vi) Geometric distributions, in which R — {0, 1,2,...} and pk = 
(1 — p) k p for all natural numbers k and some parameter 
< p < 1; and 

(vii) Poisson distributions, in which R — {0, 1,2,.. .} and p k = 

— for all natural numbers /c and some parameter A. 

Now we turn to non-discrete random variables X taking values 
in some range R. We say that a random variable is continuous if 
P(X = x) = for all x £ R (here we assume that all points are 
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measurable). If R is already equipped with some reference measure 
dm (e.g. Lebesgue measure in the case of scalar, vector, or matrix- 
valued random variables), we say that the random variable is abso- 
lutely continuous if P(X e S) = for all null sets S in R. By the 
Radon- Nikodym theorem (see e.g. [Ta2010, §1.10]), we can thus find 
a non-negative, absolutely integrable function / e L}{R, dm) with 
J R f dm = 1 such that 

(1.5) fix(S) = f f dm 

Js 

for all measurable sets S C R. More succinctly, one has 

(1.6) d\ix = f dm. 

We call / the probability density function of the probability distribu- 
tion jix (and thus, of the random variable X). As usual in measure 
theory, this function is only defined up to almost everywhere equiva- 
lence, but this will not cause any difficulties. 

In the case of real- valued random variables X, the distribution [ix 
can also be described in terms of the cumulative distribution function 

(1.7) F x (x):=-p(X<x) = nx((-oo,x]). 

Indeed, \xx is the Lebesgue- Stieltjes measure of Fx, and (in the ab- 
solutely continuous case) the derivative of F x exists and is equal to 
the probability density function almost everywhere. We will not use 
the cumulative distribution function much in this text, although we 
will be very interested in bounding tail events such as P(X > A) or 
P(X < A). 

We give some basic examples of absolutely continuous scalar dis- 
tributions: 

(i) uniform distributions, in which / := ^yyl/ for some subset 
/ of the reals or complexes of finite non-zero measure, e.g. 
an interval [a, b] in the real line, or a disk in the complex 
plane. 

(ii) The real normal distribution N(/j,, a 2 ) = N(fi, <t 2 )r of mean 
li e R and variance cr 2 > 0, given by the density function 
f(x) := y=f exp(— (x — fi) 2 /2a 2 ) for x € R. We isolate in 
particular the standard (real) normal distribution N(0,l). 
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Random variables with normal distributions are known as 
Gaussian random variables. 

(iii) The complex normal distribution N(/j,a 2 )c of mean ji € C 
and variance a 1 > 0, given by the density function f(z) :— 
^2 exp(— \z — [i\ 2 /a 2 ). Again, we isolate the standard com- 
plex normal distribution iV(0, l)c- 

Later on, we will encounter several more scalar distributions of 
relevance to random matrix theory, such as the semicircular law or 
Marcenko-Pastur law. We will also of course encounter many ma- 
trix distributions (also known as matrix ensembles) as well as point 
processes. 

Given an unsigned random variable X (i.e. a random variable 
taking values in [0, +oo]), one can define the expectation or mean EX 
as the unsigned integral 

/•OO 

(1.8) EX := / x dfi x (x), 

Jo 

which by the Fubini-Tonelli theorem (see e.g. [Ta2011, §1.7]) can 
also be rewritten as 

/>oo 

(1.9) EX = P(X > A) dX. 

Jo 

The expectation of an unsigned variable lies in also [0, +oo]. If X is 
a scalar random variable (which is allowed to take the value oo) for 
which E\X\ < oo, we say that X is absolutely integrable, in which 
case we can define its expectation as 

(1.10) EX := I x d[i x {x) 

Jtl 

in the real case, or 

(1.11) EX :~ z d^x(z) 

Jc 

in the complex case. Similarly for vector-valued random variables 
(note that in finite dimensions, all norms are equivalent, so the pre- 
cise choice of norm used to define \X\ is not relevant here). If X = 
(Xi, . . . , X n ) is a vector- valued random variable, then X is absolutely 
integrable if and only if the components Xi are all absolutely inte- 
grable, in which case one has EX = (EX\, . . . , EX n ). 
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Examples 1.1.8. A deterministic scalar random variable c is its 
own mean. An indicator function 1(E) has mean P(E). An unsigned 
Bernoulli variable (as denned previously) has mean p, while a signed 
or lazy signed Bernoulli variable has mean 0. A real or complex 
Gaussian variable with distribution N(fi, a 2 ) has mean /i. A Poisson 
random variable has mean A; a geometric random variable has mean 
p. A uniformly distributed variable on an interval [a, b] C R has mean 

a+b 



A fundamentally important property of expectation is that it is 
linear: if Xi , . . . , X k are absolutely intcgrable scalar random vari- 
ables and ci, . . . , c k are finite scalars, then C\X\ + • • • + c k X k is also 
absolutely integrable and 



By the Fubini-Tonelli theorem, the same result also applies to infinite 
sums Y^iLi c iXi provided that Y^Li \ c i\^\Xi\ is finite. 

We will use linearity of expectation so frequently in the sequel 
that we will often omit an explicit reference to it when it is being 
used. It is important to note that linearity of expectation requires no 
assumptions of independence or dependence amongst the individual 
random variables A",; this is what makes this property of expectation 
so powerful. 

In the unsigned (or real absolutely integrable) case, expectation is 
also monotone: if X < Y is true for some unsigned or real absolutely 
integrable X, Y, then EA < EF. Again, we will usually use this 
basic property without explicit mentioning it in the sequel. 

For an unsigned random variable, we have the obvious but very 
useful Markov inequality 



for any A > 0, as can be seen by taking expectations of the inequality 
AI(A > A) < X. For signed random variables, Markov's inequality 
becomes 



2 



(1.12) 



EciXi + • • • + c fe A fe = ciEXi + • • • + c fc EA fc . 



(1.13) 



P(A > A) < -J-EA 



(1.14) 
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Another fact related to Markov's inequality is that if X is an un- 
signed or real absolutely integrable random variable, then X > EX 
must hold with positive probability, and also X < EX must also hold 
with positive probability. Use of these facts or (1.13), (1-14), com- 
bined with monotonicity and linearity of expectation, is collectively 
referred to as the first moment method. This method tends to be par- 
ticularly easy to use (as one does not need to understand dependence 
or independence), but by the same token often gives sub-optimal re- 
sults (as one is not exploiting any independence in the system). 

Exercise 1.1.1 (Borel-Cantelli lemma). Let E\, E 2 , ■ ■ ■ be a sequence 
of events such that J^. P(-Ej) < oo. Show that almost surely, at most 
finitely many of the events occur at once. State and prove a result 
to the effect that the condition ^\ P(-Ej) < oo cannot be weakened. 

If X is an absolutely integrable or unsigned scalar random vari- 
able, and F is a measurable function from the scalars to the unsigned 
extended reals [0, +oo], then one has the change of variables formula 

(1.15) EF{X) = [ F(x) dpi X {x) 
when X is real- valued and 

(1.16) EF(X) = f F{z) dfix(z) 

Jc 

when X is complex-valued. The same formula applies to signed or 
complex F if it is known that |.F(X)| is absolutely integrable. Impor- 
tant examples of expressions such as EF(X) are moments 

(1.17) E|X| fe 

for various k > 1 (particularly k = 1, 2,4), exponential moments 

(1.18) Ee* x 

for real t, X, and Fourier moments (or the characteristic function) 

(1.19) Ee ltx 
for real t, X, or 

(1.20) Ee"' x 
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for complex or vector-valued t, X, where • denotes a real inner prod- 
uct. We shall also occasionally encounter the resolvents 

^ E xb 

for complex z, though one has to be careful now with the absolute 
convergence of this random variable. Similarly, we shall also occasion- 
ally encounter negative moments E\X\~ k of X, particularly for k = 2. 
We also sometimes use the zeroth moment E|Jf |° = P(X ^ 0), where 
we take the somewhat unusual convention that x° := lim fc _ K) + x k for 
non-negative x, thus x° := 1 for x > and 0° := 0. Thus, for in- 
stance, the union bound (1.1) can be rewritten (for finitely many i, 
at least) as 

(1.22) E|^ C4 X 4 |°<^| Ci |°E|X 4 |° 

i i 

for any scalar random variables X; t and scalars q (compare with 
(1.12)). 

It will be important to know if a scalar random variable X is 
"usually bounded" . We have several ways of quantifying this, in de- 
creasing order of strength: 

(i) X is surely bounded if there exists an M > such that 
\X\ < M surely. 

(ii) X is almost surely bounded if there exists an M > such 
that \X\ < M almost surely. 

(iii) X is subgaussian if there exist C, c > such that P(|^| > 
A) < Cexp(-cA 2 ) for all A > 0. 

(iv) X has sub- exponential tail if there exist C, c, a > such that 
P(\X\ > A) < Ccxp(-cA a ) for all A > 0. 

(v) X has finite k th moment for some k > 1 if there exists C 
such that E|X| fc < C. 

(vi) X is absolutely integrable if E|X| < oo. 

(vii) X is almost surely finite if |X| < oo almost surely. 

Exercise 1.1.2. Show that these properties genuinely are in decreas- 
ing order of strength, i.e. that each property on the list implies the 
next. 



1.1. A review of probability theory 



19 



Exercise 1.1.3. Show that each of these properties are closed under 
vector space operations, thus for instance if X, Y have sub-exponential 
tail, show that X + Y and cX also have sub-exponential tail for any 
scalar c. 

Examples 1.1.9. The various species of Bernoulli random variable 
are surely bounded, and any random variable which is uniformly dis- 
tributed in a bounded set is almost surely bounded. Gaussians and 
Poisson distributions are subgaussian, while the geometric distribu- 
tion merely has sub-exponential tail. Cauchy distributions (which 
have density functions of the form f(x) = - ^ X _ x ^y 2+J 2 ) are typical 
examples of heavy-tailed distributions which arc almost surely finite, 
but do not have all moments finite (indeed, the Cauchy distribution 
does not even have finite first moment). 

If we have a family of scalar random variables X a depending on 
a parameter a, we say that the X a are uniformly surely bounded 
(resp. uniformly almost surely bounded, uniformly subgaussian, have 
uniform sub-exponential tails, or uniformly bounded fc th moment) 
if the relevant parameters M, C, c, a in the above definitions can be 
chosen to be independent of a. 

Fix k > 1. If X has finite fc th moment, say E|X| fc < C, then from 
Markov's inequality (1.14) one has 

(1.23) P(|X| > A) < C\- k , 

thus we see that the higher the moments that we control, the faster 
the tail decay is. From the dominated convergence theorem we also 
have the variant 

(1.24) lim X k P(\X\ > A) =0. 

A— >oo 

However, this result is qualitative or ineffective rather than quanti- 
tative because it provides no rate of convergence of A^PdA^I > A) 
to zero. Indeed, it is easy to construct a family X a of random vari- 
ables of uniformly bounded k th moment, but for which the quantities 
A fe P(|X Q | > A) do not converge uniformly to zero (e.g. take X m to be 
m times the indicator of an event of probability mT k for m = 1, 2, . . .). 
Because of this issue, we will often have to strengthen the property 
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of having a uniformly bounded moment, to that of obtaining a uni- 
formly quantitative control on the decay in (1.24) for a family X a of 
random variables; we will see examples of this in later lectures. How- 
ever, this technicality does not arise in the important model case of 
identically distributed random variables, since in this case we trivially 
have uniformity in the decay rate of (1-24). 

We observe some consequences of (1.23) and the preceding defi- 
nitions: 

Lemma 1.1.10. Let X = X n be a scalar random variable depending 
on a parameter n. 

(i) // \X n \ has uniformly bounded expectation, then for any 
e > independent of n, we have \X n \ = 0(n £ ) with high 
probability. 

(ii) // X n has uniformly bounded k th moment, then for any A > 
; we have \X n \ = 0(n A / k ) with probability 1 — 0(n~ A ). 

(iii) If X n has uniform sub- exponential tails, then we have \X n \ = 
0(log°^ n) with overwhelming probability. 

Exercise 1.1.4. Show that a real- valued random variable X is sub- 
gaussian if and only if there exists C > such that Ee* x < C exp(Ci 2 ) 
for all real t, and if and only if there exists C > such that E|X| fe < 
{Ck) k ' 2 for all k > 1. 

Exercise 1.1.5. Show that a real- valued random variable X has 
subexponcntial tails if and only if there exists C > such that 
E|X| fc < exp(Ck c ) for all positive integers k. 

Once the second moment of a scalar random variable is finite, one 
can define the variance 

(1.25) Varpf) := E|X - E(X)\ 2 . 

From Markov's inequality we thus have Chebyshev's inequality 

(1.26) P(|x-E(X)|>A)<^^. 

Upper bounds on P(|X — E(X)| > A) for A large are known as large 
deviation inequalities. Chebyshev's inequality (1.26) gives a simple 
but still useful large deviation inequality, which becomes useful once 
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A exceeds the standard deviation Var(X) 1 / 2 of the random variable. 
The use of Chebyshev's inequality, combined with a computation of 
variances, is known as the second moment method. 

Exercise 1.1.6 (Scaling of mean and variance). If X is a scalar 
random variable of finite mean and variance, and a, b are scalars, 
show that E(a + bX) = a + bE(X) and Var(a + bX) = |6| 2 Var(X). 
In particular, if X has non-zero variance, then there exist scalars a, b 
such that a + bX has mean zero and variance one. 

Exercise 1.1.7. We say that a real number M(X) is a median of a 
real-valued random variable X if P(X > M(X)),P(X < M(X)) < 
1/2. 

(i) Show that a median always exists, and if X is absolutely 
continuous with strictly positive density function, then the 
median is unique. 

(ii) If X has finite second moment, show that M.(X) = E(X) + 
0(Var(A:) 1 / 2 ) for any median M(X). 

Exercise 1.1.8 (Jensen's inequality). Let F : R — > R be a convex 
function (thus F((l - t)x + ty) < (1 - t)F{x) + tF(y) for all x, y e R 
and < t < 1), and let X be a bounded real-valued random variable. 
Show that EF(X) > F(EX). (Hint: Bound F from below using a 
tangent line at EX.) Extend this inequality to the case when X takes 
values in R™ (and F has R™ as its domain.) 

Exercise 1.1.9 (Paley-Zygmund inequality). Let X be a positive 
random variable with finite variance. Show that 

P(X>AE(X))>(1-A) 2( ||f 

for any < A < 1. 

If X is subgaussian (or has sub-exponential tails with exponent 
a > 1), then from dominated convergence we have the Taylor expan- 
sion 

OO 

(1.27) Ee tx = l + J2—EX k 

fc=i 

for any real or complex t, thus relating the exponential and Fourier 
moments with the k th moments. 
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1.1.3. Independence. When studying the behaviour of a single 
random variable X, the distribution /ix captures all the probabilistic 
information one wants to know about X. The following exercise is 
one way of making this statement rigorous: 

Exercise 1.1.10. Let X, X' be random variables (on sample spaces 
fl,fl' respectively) taking values in a range R, such that X = X' . 
Show that after extending the spaces fi, SI', the two random variables 
X, X' are isomorphic, in the sense that there exists a probability 
space isomorphism 7r : 51 — > SI' (i.e. an invertible extension map 
whose inverse is also an extension map) such that X = X' o n. 

However, once one studies families {X a ) ae A of random variables 
X a taking values in measurable spaces R a (on a single sample space 
SI), the distribution of the individual variables X a are no longer 
sufficient to describe all the probabilistic statistics of interest; the 
joint distribution of the variables (i.e. the distribution of the tuple 
(l a ) a£ i, which can be viewed as a single random variable taking val- 
ues in the product measurable space n^eA a ^ so becomes relevant. 

Example 1.1.11. Let (Xi,^) be drawn uniformly at random from 
the set {(-1, -1), (-1, +1), (+1, -1), (+1, +1)}. Then the random 
variables X\, X 2 , and — X\ all individually have the same distribu- 
tion, namely the signed Bernoulli distribution. However the pairs 
(Xi 7 X 2 ), (X\,Xi), and {X\,— Xi) all have different joint distribu- 
tions: the first pair, by definition, is uniformly distributed in the set 

{(-1,-1),(-1,+1),(+1,-1),(+1,+1)}, 

while the second pair is uniformly distributed in {(—1,-1), (+1, +1)}, 
and the third pair is uniformly distributed in {(— 1,+1), (+1,-1)}. 
Thus, for instance, if one is told that X, Y are two random variables 
with the Bernoulli distribution, and asked to compute the probability 
that X — Y , there is insufficient information to solve the problem; 
if (X, Y) were distributed as (Xl,X 2 ), then the probability would 
be 1/2, while if (X, Y) were distributed as (X\,X{), the probability 
would be 1, and if (X, Y) were distributed as {X\,— X\), the proba- 
bility would be 0. Thus one sees that one needs the joint distribution, 
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and not just the individual distributions, to obtain a unique answer 
to the question. 

There is however an important special class of families of ran- 
dom variables in which the joint distribution is determined by the 
individual distributions. 

Definition 1.1.12 (Joint independence). A family (X a ) ae A of ran- 
dom variables (which may be finite, countably infinite, or uncount- 
ably infinite) is said to be jointly independent if the distribution of 
(I a ) o£ i is the product measure of the distribution of the individual 
X a . 

A family (X a ) ae A is said to be pairwise independent if the pairs 
(X a ,Xp) arc jointly independent for all distinct a, (3 G A. More 
generally, (X a ) ae A is said to be k-wise independent if (X ai , . . . , X a , ) 
are jointly independent for all 1 < k' < k and all distinct a>i, . . . , a^i € 
A. 

We also say that X is independent of Y if (X, Y) are jointly 
independent. 

A family of events (E a ) ae A is said to be jointly independent if 
their indicators (l(E a )) ae A are jointly independent. Similarly for 
pairwise independence and fc-wise independence. 

From the theory of product measure, we have the following equiv- 
alent formulation of joint independence: 

Exercise 1.1.11. Let (X a ) ae A be a family of random variables, with 
each X a taking values in a measurable space R a . 

(i) Show that the (X a ) ae A are jointly independent if and only 
for every collection of distinct elements a\, . . . , of A, and 
all measurable subsets Ei C R ai for 1 < i < k' , one has 

k' 

P(X a . e Ei for all 1 < i < k') = ]JP(X ai e E t ). 

(ii) Show that the necessary and sufficient condition (X a ) ae A 
being k-wise independent is the same, except that k' is con- 
strained to be at most k. 
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In particular, a finite family (Xi, . . . , Xk) of random variables Xi, 1 < 
i < fc taking values in measurable spaces Ri are jointly independent 
if and only if 

k 

P(X l e E, for all 1 < i < k) = ]J P(X e 

for all measurable Ei C Ri. 

If the X Q are discrete random variables, one can take the Ei to 
be singleton sets in the above discussion. 

From the above exercise we see that joint independence implies k- 
wise independence for any k, and that joint independence is preserved 
under permuting, relabeling, or eliminating some or all of the X a . A 
single random variable is automatically jointly independent, and so 
1-wise independence is vacuously true; pairwise independence is the 
first nontrivial notion of independence in this hierarchy. 

Example 1.1.13. Let F 2 be the field of two elements, let V C 
be the subspace of triples (xi,x 2l x 3 ) <G F 2 with xi + x 2 + £3 = 0, 
and let (Xi, X 2 , X 3 ) be drawn uniformly at random from V. Then 
(Xi, X 2 , X 3 ) are pairwise independent, but not jointly independent. 
In particular, X 3 is independent of each of X\ , X 2 separately, but is 
not independent of (Xi,X 2 ). 

Exercise 1.1.12. This exercise generalises the above example. Let 
F be a finite field, and let V be a subspace of F" for some finite n. 
Let (Xi, . . . , X n ) be drawn uniformly at random from V. Suppose 
that V is not contained in any coordinate hyperplane in F". 

(i) Show that each Xi, 1 < i < n is uniformly distributed in F. 

(ii) Show that for any k > 2, that (X\, . . . ,X n ) is fc-wise inde- 
pendent if and only if V is not contained in any hyperplane 
which is definable using at most k of the coordinate vari- 
ables. 

(iii) Show that (Xi, . . . , X n ) is jointly independent if and only if 
V = F™. 
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Informally, we thus see that imposing constraints between k variables 
at a time can destroy fc-wise independence, while leaving lower-order 
independence unaffected. 

Exercise 1.1.13. Let V C Fjj be the subspace of triples (xi, x 2 , X3) E 
F| with X1+X2 = 0, and let {X\, X2,X$) be drawn uniformly at ran- 
dom from V. Then X 3 is independent of (Xi 7 X 2 ) (and in particular, 
is independent of x\ and x 2 separately), but Xi,X 2 are not indepen- 
dent of each other. 

Exercise 1.1.14. We say that one random variable Y (with values 
in Ry) is determined by another random variable X (with values in 
Rx) if there exists a (deterministic) function / : Rx — > Ry such that 
Y = f(X) is surely true (i.e. Y(cj) = f{X{uo)) for all u e 0). Show 
that if (X a ) a< zA is a family of jointly independent random variables, 
and (Yp)p£B is a family such that each Yp is determined by some 
subfamily (X q ,) q , Gj 4 (3 of the (X a ) aG A, with the Ap disjoint as (3 varies, 
then the (Yp)p e B are jointly independent also. 

Exercise 1.1.15 (Determinism vs. independence). Let X, Y be ran- 
dom variables. Show that Y is deterministic if and only if it is simul- 
taneously determined by X, and independent of X. 

Exercise 1.1.16. Show that a complex random variable X is a 
complex Gaussian random variable (i.e. its distribution is a com- 
plex normal distribution) if and only if its real and imaginary parts 
Re(X),lm(X) are independent real Gaussian random variables with 
the same variance. In particular, the variance of Ke(X) and lm(X) 
will be half of variance of X. 

One key advantage of working with jointly independent random 
variables and events is that one can compute various probabilistic 
quantities quite easily. We give some key examples below. 

Exercise 1.1.17. If Ei, . . . , E k are jointly independent events, show 
that 



k 



k 



(1.28) 
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and 

k k 

(1.29) p(v^)=i-n(i-p(^))- 

Show that the converse statement (i.e. that (1-28) and (1.29) imply 
joint independence) is true for k — 2, but fails for higher k. Can one 
find a correct replacement for this converse for higher fc? 

Exercise 1.1.18. 

(i) If X\, . . . , Xk are jointly independent random variables tak- 
ing values in [0, +oo], show that 

k k 

E n^ =ri E ^ 

i=l i=l 

(ii) If Xi, . . . ,X k arc jointly independent absolutely integrable 
scalar random variables taking values in [0, +oo], show that 
rii=i Xi is absolutely integrable, and 

k k 

E n^ =ri E ^ 

Remark 1.1.14. The above exercise combines well with Exercise 
1.1.14. For instance, if X\, . . . ,Xk are jointly independent subgaus- 
sian variables, then from Exercises 1.1.14, 1.1.18 we see that 

k k 

(1.30) Ejje tx =f[Ve tx * 

for any complex t. This identity is a key component of the exponential 
moment method, which we will discuss in Section 2.1. 

The following result is a key component of the second moment 
method. 

Exercise 1.1.19 (Pairwise independence implies linearity of vari- 
ance). If Xi,...,Xk arc pairwise independent scalar random vari- 
ables of finite mean and variance, show that 

k k 
Var(^X i )=^Var(X i ) 
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and more generally 

k k 
Var(]T Cl X % ) = ]T \ Cl \ 2 Var(X t ) 

i=l i=l 

for any scalars a (compare with (1-12), (1.22)). 

The product measure construction allows us to extend Lemma 
1.1.7: 

Exercise 1.1.20 (Creation of new, independent random variables). 
Let (X a ) ae A be a family of random variables (not necessarily inde- 
pendent or finite), and let (/j,p)p & B be a collection (not necessarily 
finite) of probability measures fip on measurable spaces Rp. Then, 
after extending the sample space if necessary, one can find a fam- 
ily (Yp)p e s of independent random variables, such that each Yp has 
distribution up, and the two families (X a ) ae A and (Yp)p e B are inde- 
pendent of each other. 

We isolate the important case when [ip = \i is independent of 
(3. We say that a family (X a ) aG A of random variables is indepen- 
dently and identically distributed, or iid for short, if they are jointly 
independent and all the X a have the same distribution. 

Corollary 1.1.15. Let (X a ) ae A be a family of random variables (not 
necessarily independent or finite), let fi be a probability measure on 
a measurable space R, and let B be an arbitrary set. Then, after 
extending the sample space if necessary, one can find an iid family 
(Yp)p G B with distribution [i which is independent of (X a ) ae A- 

Thus, for instance, one can create arbitrarily large iid families 
of Bernoulli random variables, Gaussian random variables, etc., re- 
gardless of what other random variables are already in play. We thus 
see that the freedom to extend the underlying sample space allows 
us access to an unlimited source of randomness. This is in contrast 
to a situation studied in complexity theory and computer science, in 
which one does not assume that the sample space can be extended at 
will, and the amount of randomness one can use is therefore limited. 
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Remark 1.1.16. Given two probability measures fix,^Y on two 
measurable spaces Rx,Ry, a- joining or coupling of the these mea- 
sures is a random variable (X, Y) taking values in the product space 
Rx*Ry, whose individual components X, Y have distribution [ix,^Y 
respectively. Exercise 1.1.20 shows that one can always couple two 
distributions together in an independent manner; but one can cer- 
tainly create non-independent couplings as well. The study of cou- 
plings (or joinings) is particularly important in ergodic theory, but 
this will not be the focus of this text. 

1.1.4. Conditioning. Random variables are inherently non-deterministic 
in nature, and as such one has to be careful when applying determin- 
istic laws of reasoning to such variables. For instance, consider the 
law of the excluded middle: a statement P is cither true or false, but 
not both. If this statement is a random variable, rather than deter- 
ministic, then instead it is true with some probability p and false with 
some complementary probability 1 — p. Also, applying set-theoretic 
constructions with random inputs can lead to sets, spaces, and other 
structures which are themselves random variables, which can be quite 
confusing and require a certain amount of technical care; consider, for 
instance, the task of rigorously defining a Euclidean space R d when 
the dimension d is itself a random variable. 

Now, one can always eliminate these difficulties by explicitly 
working with points u> in the underlying sample space f2, and replac- 
ing every random variable X by its evaluation X(u>) at that point; 
this removes all the randomness from consideration, making every- 
thing deterministic (for fixed u). This approach is rigorous, but goes 
against the "probabilistic way of thinking" , as one now needs to take 
some care in extending the sample space. 

However, if instead one only seeks to remove a partial amount 
of randomness from consideration, then one can do this in a manner 
consistent with the probabilistic way of thinking, by introducing the 
machinery of conditioning. By conditioning an event to be true or 
false, or conditioning a random variable to be fixed, one can turn that 
random event or variable into a deterministic one, while preserving the 
random nature of other events and variables (particularly those which 
are independent of the event or variable being conditioned upon). 
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We begin by considering the simpler situation of conditioning on 
an event. 

Definition 1.1.17 (Conditioning on an event). Let E be an event 
(or statement) which holds with positive probability P(E). By con- 
ditioning on the event E, we mean the act of replacing the underlying 
sample space with the subset of il where E holds, and replacing 
the underlying probability measure P by the conditional probability 
measure P(\E), defined by the formula 



All events F on the original sample space can thus be viewed as events 
(F\E) on the conditioned space, which we model set-theoretically as 
the set of all ui in E obeying F. Note that this notation is compatible 
with (1.31). 

All random variables X on the original sample space can also be 
viewed as random variables X on the conditioned space, by restric- 
tion. We will refer to this conditioned random variable as (X\E), and 
thus define conditional distribution H(x\e) an d conditional expecta- 
tion E(X\E) (if X is scalar) accordingly. 

One can also condition on the complementary event E, provided 
that this event holds with positive probility also. 

By undoing this conditioning, we revert the underlying sample 
space and measure back to their original (or unconditional) values. 
Note that any random variable which has been defined both after 
conditioning on E, and conditioning on E, can still be viewed as a 
combined random variable after undoing the conditioning. 

Conditioning affects the underlying probability space in a manner 
which is different from extension, and so the act of conditioning is not 
guaranteed to preserve probabilistic concepts such as distribution, 
probability, or expectation. Nevertheless, the conditioned version of 
these concepts are closely related to their unconditional counterparts: 

Exercise 1.1.21. If E and E both occur with positive probability, 
establish the identities 



(1.31) 



P(F\E) :=P{F AE)/P(E). 



(1.32) 



P(F) = P(F\E)P(E) + P(F\E)P(E) 
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for any (unconditional) event F and 

(1.33) fi x = ^x\E)P(E) + » {xm p (E) 

for any (unconditional) random variable X (in the original sample 
space). In a similar spirit, if X is a non-negative or absolutely in- 
tegrable scalar (unconditional) random variable, show that (X\E), 
(X\E) are also non- negative and absolutely integrable on their re- 
spective conditioned spaces, and that 

(1.34) EX = E(X\E)P(E) + E(X\E)P(E). 

In the degenerate case when E occurs with full probability, condition- 
ing to the complementary event E is not well defined, but show that 
in those cases we can still obtain the above formulae if we adopt the 
convention that any term involving the vanishing factor P(E) should 
be omitted. Similarly if E occurs with zero probability. 

The above identities allow one to study probabilities, distribu- 
tions, and expectations on the original sample space by conditioning 
to the two conditioned spaces. 

From (1.32) we obtain the inequality 

(1.35) P(F\E) < P(F)/P(£7), 

thus conditioning can magnify probabilities by a factor of at most 
1/P(E). In particular, 

(i) If F occurs unconditionally surely, it occurs surely condi- 
tioning on E also. 

(ii) If F occurs unconditionally almost surely, it occurs almost 
surely conditioning on E also. 

(iii) If F occurs unconditionally with overwhelming probability, 
it occurs with overwhelming probability conditioning on E 
also, provided that P(E) > crT c for some c, C > inde- 
pendent of n. 

(iv) If F occurs unconditionally with high probability, it occurs 
with high probability conditioning on E also, provided that 
P(E) > cn~ a for some c > and some sufficiently small 
a > independent of n. 
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(v) If F occurs unconditionally asymptotically almost surely, it 
occurs asymptotically almost surely conditioning on E also, 
provided that P(E) > c for some c > independent of n. 

Conditioning can distort the probability of events and the dis- 
tribution of random variables. Most obviously, conditioning on E 
elevates the probability of E to 1, and sends the probability of the 
complementary event E to zero. In a similar spirit, if X is a random 
variable uniformly distributed on some finite set S, and S' is a non- 
empty subset of S, then conditioning to the event X e S' alters the 
distribution of X to now become the uniform distribution on S' rather 
than S (and conditioning to the complementary event produces the 
uniform distribution on S\S'). 

However, events and random variables that are independent of the 
event E being conditioned upon are essentially unaffected by condi- 
tioning. Indeed, if F is an event independent of E, then (F\E) occurs 
with the same probability as F; and if X is a random variable inde- 
pendent of E (or equivalently, independently of the indicator 1(E)), 
then (X\E) has the same distribution as X. 

Remark 1.1.18. One can view conditioning to an event E and its 
complement E as the probabilistic analogue of the law of the excluded 
middle. In deterministic logic, given a statement P, one can divide 
into two separate cases, depending on whether P is true or false; 
and any other statement Q is unconditionally true if and only if it is 
conditionally true in both of these two cases. Similarly, in probability 
theory, given an event E, one can condition into two separate sample 
spaces, depending on whether E is conditioned to be true or false; and 
the unconditional statistics of any random variable or event are then 
a weighted average of the conditional statistics on the two sample 
spaces, where the weights are given by the probability of E and its 
complement. 

Now we consider conditioning with respect to a discrete random 
variable Y, taking values in some range R. One can condition on any 
event Y = y, y € R which occurs with positive probability. It is then 
not difficult to establish the analogous identities to those in Exercise 
1.1.21: 



32 



1. Preparatory material 



Exercise 1.1.22. Let Y be a discrete random variable with range R. 
Then we have 

(1.36) P(F) = J2P(F\Y = y)P(Y = y) 

y£R 

for any (unconditional) event F, and 

(1-37) = J2 H(x\Y= v )P(Y = y) 

v&R 

for any (unconditional) random variable X (where the sum of non- 
negative measures is defined in the obvious manner), and for abso- 
lutely integrable or non-negative (unconditional) random variables X, 
one has 

(1.38) EJf = E(X|y = y)P(Y = y). 

y£R 

In all of these identities, we adopt the convention that any term in- 
volving P(y = y) is ignored when P(y = y) = 0. 

With the notation as in the above exercise, we define 4 the condi- 
tional probability T?(F\Y) of an (unconditional) event F conditioning 
on Y to be the (unconditional) random variable that is defined to 
equal P(_F|y = y) whenever Y — y, and similarly, for any absolutely 
integrable or non- negative (unconditional) random variable X, we 
define the conditional expectation ~E(X\Y) to be the (unconditional) 
random variable that is defined to equal E(X|y = y) whenever Y = y. 
Thus (1.36), (1.38) simplify to 

(1.39) P(F) = E(P(F|y)) 
and 

(1.40) Epf) = E(E(X|y)). 

From (1.12) we have the linearity of conditional expectation 

(1.41) E(c 1 X 1 + ■■■ + c k X k \Y) = c 1 E(X 1 \Y) + ■■■ + c k E(X k \Y), 
where the identity is understood to hold almost surely. 



Strictly speaking, since we are not defining conditional expectation when P(V — 
y) — 0, these random variables are only defined almost surely, rather than surely, but 
this will not cause difficulties in practice; see Remark 1.1.5. 
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Remark 1.1.19. One can interpret conditional expectation as a type 
of orthogonal projection; see for instance [Ta2009, §2.8]. But we will 
not use this perspective in this course. Just as conditioning on an 
event and its complement can be viewed as the probabilistic analogue 
of the law of the excluded middle, conditioning on a discrete random 
variable can be viewed as the probabilistic analogue of dividing into 
finitely or countably many cases. For instance, one could condition on 
the outcome Y G {1, 2, 3, 4, 5, 6} of a six-sided die, thus conditioning 
the underlying sample space into six separate subspaces. If the die is 
fair, then the unconditional statistics of a random variable or event 
would be an unweighted average of the conditional statistics of the 
six conditioned subspaces; if the die is weighted, one would take a 
weighted average instead. 

Example 1.1.20. Let Xi,X2 be iid signed Bernoulli random vari- 
ables, and let Y := X\ + X2, thus Y is a discrete random variable tak- 
ing values in —2,0, +2 (with probability 1/4, 1/2, 1/4 respectively). 
Then X\ remains a signed Bernoulli random variable when condi- 
tioned to Y = 0, but becomes the deterministic variable +1 when 
conditioned to Y = +2, and similarly becomes the deterministic vari- 
able — 1 when conditioned to Y = —2. As a consequence, the con- 
ditional expectation E(Xi|Y) is equal to when Y = 0, +1 when 
Y = +2, and -1 when Y = -2; thus E(X-i_\Y) = Y/2. Similarly 
E(X 2 |Y) = Y/2; summing and using the linearity of conditional ex- 
pectation we obtain the obvious identity E(Y|Y) = Y. 

If X, Y are independent, then (X\Y = y) = X for all y (with the 
convention that those y for which P(Y = y) = are ignored), which 
implies in particular (for absolutely integrable X) that 

E{X\Y) = E(X) 

(so in this case the conditional expectation is a deterministic quan- 
tity). 

Example 1.1.21. Let X, Y be bounded scalar random variables (not 
necessarily independent), with Y discrete. Then we have 



E(XY) = E(E(XY\Y)) = E(YE(X\Y)) 
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where the latter equality holds since Y clearly becomes deterministic 
after conditioning on Y. 

We will also need to condition with respect to continuous random 
variables (this is the probabilistic analogue of dividing into a poten- 
tially uncountable number of cases). To do this formally, we need 
to proceed a little differently from the discrete case, introducing the 
notion of a disintegration of the underlying sample space. 

Definition 1.1.22 (Disintegration). Let Y be a random variable with 
range R. A disintegration (R' , (p-y)yeR') of the underlying sample 
space fi with respect to Y is a subset R! of R of full measure in [iy 
(thus Y € R' almost surely) , together with assignment of a probability 
measure P(\Y = y) on the subspace Cl y := {ut e il : Y(u>) = y} of 
fl for each y e R, which is measurable in the sense that the map 
y i ^ P(F\Y = y) is measurable for every event F, and such that 

P(F) = EP(F|Y) 

for all such events, where P(F\Y) is the (almost surely defined) ran- 
dom variable defined to equal P(F\Y = y) whenever Y = y. 

Given such a disintegration, we can then condition to the event 
Y = y for any y e R! by replacing Q with the subspace $l y (with the 
induced cr-algebra) , but replacing the underlying probability measure 
P with P(|y = y). We can thus condition (unconditional) events 
F and random variables X to this event to create conditioned events 
(F\Y = y) and random variables (X\Y = y) on the conditioned space, 
giving rise to conditional probabilities P(F\Y = y) (which is consis- 
tent with the existing notation for this expression) and conditional 
expectations E(X|Y = y) (assuming absolute integrability in this 
conditioned space). We then set E(X|Y) to be the (almost surely de- 
fined) random variable defined to equal E(X|Y = y) whenever Y = y. 

Example 1.1.23 (Discrete case). If Y is a discrete random variable, 
one can set R' to be the essential range of Y, which in the discrete case 
is the set of all y e R for which P(Y = y) > 0. For each y € R', we 
define P(|Y = y) to be the conditional probability measure relative 
to the event Y = y, as defined in Definition 1.1.17. It is easy to 
verify that this is indeed a disintegration; thus the continuous notion 
of conditional probability generalises the discrete one. 
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Example 1.1.24 (Independent case). Starting with an initial sample 
space ft, and a probability measure /i on a measurable space R, one 
can adjoin a random variable Y taking values in R with distribution 
fj, that is independent of all previously existing random variables, by 
extending ft to ft x R as in Lemma 1.1.7. One can then disintegrate 

Y by taking R' := R and letting /j, y be the probability measure on 
ft y = ft x {y} induced by the obvious isomorphism between ft x {y} 
and ft; this is easily seen to be a disintegration. Note that if X is any 
random variable from the original space ft, then (X\Y = y) has the 
same distribution as X for any y e R. 

Example 1.1.25. Let ft = [0, l] 2 with Lebesgue measure, and let 
(Xi,X2) be the coordinate random variables of ft, thus Xi,X 2 are iid 
with the uniform distribution on [0,1]. Let Y be the random variable 

Y := X\ + X2 with range R — R. Then one can disintegrate Y by 
taking R' = [0, 2] and letting \x y be normalised Lebesgue measure on 
the diagonal line segment {(x\, x 2 ) € [0, l] 2 : x\ + x 2 = y}- 

Exercise 1.1.23 (Almost uniqueness of disintegrations). Let (R 1 , {f-t y ) y eR'), 
(R 1 , {ft y ) yeR i) be two disintegrations of the same random variable Y. 
Show that for any event F, one has P(F\Y = y) = P(_F|y = y) for 
/xy-almost every y <E R, where the conditional probabilities P(|y = y) 
and P{\Y = y) are defined using the disintegrations {R! , (n y ) y eR'), 
(R',(jl y ) yl -^,) respectively. (Hint: argue by contradiction, and con- 
sider the set of y for which P(F\Y = y) exceeds P(F\Y — y) (or vice 
versa) by some fixed e > 0.) 

Similarly, for a scalar random variable X, show that for fiy- 
almost every y € R, that (X\Y = y) is absolutely integrable with 
respect to the first disintegration if and only if it is absolutely inte- 
grable with respect to the second integration, and one has E(X|Y = 
y) = E(X\Y = y) in such cases. 

Remark 1.1.26. Under some mild topological assumptions on the 
underlying sample space (and on the measurable space R), one can 
always find at least one disintegration for every random variable Y, 
by using tools such as the Radon-Nikodym theorem; see [Ta2009, 
Theorem 2.9.21]. In practice, we will not invoke these general re- 
sults here (as it is not natural for us to place topological conditions 
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on the sample space), and instead construct disintegrations by hand 
in specific cases, for instance by using the construction in Example 
1.1.24. 

Remark 1.1.27. Strictly speaking, disintegration is not a probabilis- 
tic concept; there is no canonical way to extend a disintegration when 
extending the sample space. However, due to the (almost) uniqueness 
and existence results alluded to earlier, this will not be a difficulty 
in practice. Still, we will try to use conditioning on continuous vari- 
ables sparingly, in particular containing their use inside the proofs of 
various lemmas, rather than in their statements, due to their slight 
incompatibility with the "probabilistic way of thinking" . 

Exercise 1.1.24 (Fubini-Tonelli theorem). Let (R' , (/z y ) y6 fl<) be a 
disintegration of a random variable Y taking values in a measurable 
space R, and let X be a non-negative (resp. absolutely integrable) 
scalar random variable. Show that for ^y-almost all y € R, (X\Y = 
y) is a non-negative (resp. absolutely integrable) random variable, 
and one has the identity 5 

(1.42) E(E(X\Y)) =E(X), 

where E(X|F) is the (almost surely defined) random variable that 
equals E(X\Y = y) whenever y e R' . More generally, show that 

(1.43) E(E(X\Y)f(Y)) = E(Xf(Y)), 

whenever / : R — > R is a non-negative (resp. bounded) measurable 
function. (One can essentially take (1.43), together with the fact 
that E(X|F) is determined by 7, as a definition of the conditional 
expectation E(X|F), but we will not adopt this approach here.) 

A typical use of conditioning is to deduce a probabilistic state- 
ment from a deterministic one. For instance, suppose one has a 
random variable X, and a parameter y in some range R, and an 
event E(X, y) that depends on both X and y. Suppose we know that 
VE(X 1 y) < e for every y € R. Then, we can conclude that when- 
ever Y is a random variable in R independent of X, we also have 



5 Note that one first needs to show that E(JC|y) is measurable before one can take 
the expectation. 



1.1. A review of probability theory 



37 



PE(X, Y) < e, regardless of what the actual distribution of Y is. In- 
deed, if we condition Y to be a fixed value y (using the construction 
in Example 1.1.24, extending the underlying sample space if neces- 
sary), we see that P(E(X, Y)\Y — y) < e for each y; and then one 
can integrate out the conditioning using (1.42) to obtain the claim. 

The act of conditioning a random variable to be fixed is occasion- 
ally also called freezing. 

1.1.5. Convergence. In a first course in undergraduate real analy- 
sis, we learn what it means for a sequence x n of scalars to converge 
to a limit x; for every e > 0, we have \x n — x\ < e for all sufficiently 
large n. Later on, this notion of convergence is generalised to metric 
space convergence, and generalised further to topological space con- 
vergence; in these generalisations, the sequence x n can lie in some 
other space than the space of scalars (though one usually insists that 
this space is independent of n). 

Now suppose that we have a sequence X n of random variables, 
all taking values in some space R; we will primarily be interested 
in the scalar case when R is equal to R or C, but will also need to 
consider fancier random variables, such as point processes or empirical 
spectral distributions. In what sense can we say that X n "converges" 
to a random variable X, also taking values in R? 

It turns out that there are several different notions of convergence 
which are of interest. For us, the four most important (in decreasing 
order of strength) will be almost sure convergence, convergence in 
probability, convergence in distribution, and tightness of distribution. 

Definition 1.1.28 (Modes of convergence). Let R = (R,d) be a o- 
compact metric space (with the Borel cr-algebra), and let X n be a 
sequence of random variables taking values in R. Let X be another 
random variable taking values in R. 

(i) X n converges almost surely to X if, for almost every lo € f2, 
X n (u>) converges to X(cu), or cquivalcntly 

P(limsupd(X„,X) < e) = 1 

n— >oo 



'A metric space is a-compact if it is the countable union of compact sets. 
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for every e > 0. 

(ii) X n converges in probability to X if, for every e > 0, one has 

limmiP(d(X n ,X) < e) = 1, 

n— ^oo 

or equivalently if d(X n , X) < e holds asymptotically almost 
surely for every e > 0. 

(iii) X n converges in distribution to X if, for every bounded con- 
tinuous function F : R — > R, one has 

lim EF(X„) = EF(X). 

n— >oo 

(iv) X„ has a sequence of distributions if, for every s > 0, 
there exists a compact subset K oi R such that P(X n € 
_ftT) > 1 — e for all sufficiently large n. 

Remark 1.1.29. One can relax the requirement that R be a u- 
compact metric space in the definitions, but then some of the nice 
equivalences and other properties of these modes of convergence begin 
to break down. In our applications, though, we will only need to 
consider the cr-compact metric space case. Note that all of these 
notions are probabilistic (i.e. they are preserved under extensions of 
the sample space). 

Exercise 1.1.25 (Implications and equivalences). Let X n ,X be ran- 
dom variables taking values in a cr-compact metric space R. 

(i) Show that if X n converges almost surely to X, then X n 
converges in probability to X. (Hint: use Fatou's lemma.) 

(ii) Show that if X n converges in distribution to X, then X n 
has a tight sequence of distributions. 

(iii) Show that if X n converges in probability to X, then X n 
converges in distribution to X. (Hint: first show tightness, 
then use the fact that on compact sets, continuous functions 
are uniformly continuous.) 

(iv) Show that X n converges in distribution to X if and only if 
Hx n converges to fix in the vague topology (i.e. J f dfix n — > 
J f dpix for all continuous functions / : R — > R of compact 
support). 
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(v) Conversely, if X n has a tight sequence of distributions, and 
Hx n is convergent in the vague topology, show that X n is 
convergent in distribution to another random variable (pos- 
sibly after extending the sample space). What happens if 
the tightness hypothesis is dropped? 

(vi) If X is deterministic, show that X n converges in probability 
to X if and only if X n converges in distribution to X. 

(vii) If X n has a tight sequence of distributions, show that there 
is a subsequence of the X n which converges in distribution. 
(This is known as Prokhorov's theorem). 

(viii) If X n converges in probability to X, show that there is a 
subsequence of the X n which converges almost surely to X. 

(ix) X n converges in distribution to X if and only if lim inf n ^oo P (X n £ 
U) > P(X e U) for every open subset U of R, or equiva- 
lently if limsup^^ P{X n e K) < P(X e K) for every 
closed subset K of R. 

Remark 1.1.30. The relationship between almost sure convergence 
and convergence in probability may be clarified by the following ob- 
servation. If E n is a sequence of events, then the indicators l(E n ) 
converge in probability to zero iff P(E n ) — > as n — > oo, but con- 
verge almost surely to zero iff P(U„>at E n ) — > as TV — > oo. 

Example 1.1.31. Let Y be a random variable drawn uniformly from 
[0,1]. For each n > 1, let E n be the event that the decimal ex- 
pansion of Y begins with the decimal expansion of n, e.g. every 
real number in [0.25,0.26) lies in E 2 5- (Let us ignore the annoying 
0.999 . . . = 1.000 . . . ambiguity in the decimal expansion here, as it 
will almost surely not be an issue.) Then the indicators I(E n ) con- 
verge in probability and in distribution to zero, but do not converge 
almost surely. 

If y n is the n th digit of Y, then the y n converge in distribution 
(to the uniform distribution on {0, 1, ... , 9}, but do not converge in 
probability or almost surely. Thus we see that the latter two notions 
are sensitive not only to the distribution of the random variables, but 
how they are positioned in the sample space. 
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The limit of a sequence converging almost surely or in probabil- 
ity is clearly unique up to almost sure equivalence, whereas the limit 
of a sequence converging in distribution is only unique up to equiv- 
alence in distribution. Indeed, convergence in distribution is really a 
statement about the distributions [ix n , fix rather than of the random 
vaariables X ni X themselves. In particular, for convergence in distri- 
bution one does not care about how correlated or dependent the X n 
are with respect to each other, or with X; indeed, they could even 
live on different sample spaces fl n ,fl and we would still have a well- 
defined notion of convergence in distribution, even though the other 
two notions cease to make sense (except when X is deterministic, 
in which case we can recover convergence in probability by Exercise 
1.1.25(vi)). 

Exercise 1.1.26 (Borel-Cantelli lemma). Suppose that X n ,X are 
random variables such that ^2 'P(d(X n , X) > e) < oo for every 
e > 0. Show that X n converges almost surely to X. 

Exercise 1.1.27 (Convergence and moments). Let X n be a sequence 
of scalar random variables, and let X be another scalar random vari- 
able. Let k, e > 0. 

(i) If sup„E|X n | fe < oo, show that X n has a tight sequence of 
distributions. 

(ii) If sup n E|X„| fc < oo and X n converges in distribution to X, 
show that E\X\ k < limhuV^ E\X n \ k . 

(iii) If sup„ E|X„| fc+£ < oo and X n converges in distribution to 
X, show that E\X\ k = lim^^ E|X„| fe . 

(iv) Give a counterexample to show that (iii) fails when e = 0, 
even if we upgrade convergence in distribution to almost 
sure convergence. 

(v) If the X n are uniformly bounded and real- valued, and EX — 
limbec ~EX k for every k = 0, 1, 2, . . ., then X n converges in 
distribution to X. (Hint: use the Weierstrass approximation 
theorem. Alternatively, use the analytic nature of the mo- 
ment generating function Ee tx and analytic continuation.) 
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(vi) If the X n are uniformly bounded and complex-valued, and 
VX k X l = lim„^ oc EX^' for every fc, Z = 0,1,2,..., then 
X n converges in distribution to X. Give a counterexample 
to show that the claim fails if one only considers the cases 
when I = 0. 

There are other interesting modes of convergence on random vari- 
ables and on distributions, such as convergence in total variation 
norm, in the Levy-Prokhorov metric, or in Wasserstein metric, but 
we will not need these concepts in this text. 

1.2. Stirling's formula 

In this section we derive Stirling's formula, which is a useful approx- 
imation for n! when n is large. This formula (and related formulae 
for binomial coefficients) (J^) will be useful for estimating a number 
of combinatorial quantities in this text, and also in allowing one to 
analyse discrete random walks accurately. 

From Taylor expansion we have x n /n\ < e x for any x > 0. Spe- 
cialising this to x = n we obtain a crude lower bound 

(1.44) n! > n n e- n . 
In the other direction, we trivially have 

(1.45) n\ < n n 

so we know already that n! is within an exponential factor of n n . 
One can do better by starting with the identity 

n 

logn! = ^ logm 

m— 1 

and viewing the right-hand side as a Riemann integral approximation 
to J™ log a; dx. Indeed a simple area comparison (cf. the integral test) 
yields the inequalities 

\ log x dx < 2_] log m < log n+ log x dx 

J l m =l Jl 

•J 

One can also obtaina cruder version of this fact that avoids Taylor expansion, by 
observing the trivial lower bound n\ > (n/2) L™-/ 2 -! coming from considering the second 
half of the product n\ — 1 n. 
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which leads to the inequalities 
(1.46) 

so the lower bound in (1.44) was only off 8 by a factor of n or so. 

One can do better by using the trapezoid rule as follows. On any 
interval [to, to + 1], logo; has a second derivative of 0(1/to 2 ), which 
by Taylor expansion leads to the approximation 



L 



m+l ^ -y 

log x dx = - log to + - log(m + !)+€„ 



for some error e m = (9(1/to 2 ). 

The error is absolutely convergent; by the integral test, we have 
12m=i e m = C + 0(l/n) for some absolute constant C := J2m=i e m- 
Performing this sum, we conclude that 

n-1 



i: 



\ogx dx = log m + i logn + C + 0(1/ n) 

m— 1 



which after some rearranging leads to the asymptotic 

(1.47) n\ = (\ + 0(\ln))e x - c ^n n e- n 

so we see that n\ actually lies roughly at the geometric mean of the 
two bounds in (1.46). 

This argument does not easily reveal what the constant C actually 
is (though it can in principle be computed numerically to any specified 
level of accuracy by this method). To find this out, we take a different 
tack, interpreting the factorial via the Gamma function T : R > R 
as follows. Repeated integration by parts reveals the identity 9 

/•OO 

(1.48) n! = / i n e _t dt. 



o 



So to estimate n\, it suffices to estimate the integral in (1.48). Elemen- 
tary calculus reveals that the integrand t n e~ l achieves its maximum 
at t = n, so it is natural to make the substitution t — n + s, obtaining 



/oo 
(n + s) n e- r s :/, 
-n 



^This illustrates a general principle, namely that one can often get a non-terrible 
bound for a series (in this case, the Taylor series for e n ) by using the largest term in 
that series (which is n n /n\). 

9 The right-hand side of (1.48), by definition, is T(n + 1). 
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which we can simplify a little bit as 

r°° s 

n\ = n n e~ n / (1 + -) n e - s ds, 
J-n n 

pulling out the now- familiar factors of n n e~ n . We combine the inte- 
grand into a single exponential, 

r°° s 
n! = n n e ™ / exp(nlog(l + -) - s) ds. 

J-n n 

From Taylor expansion we see that 

n iog(i + £) = fl -l- + ... 

so we heuristically have 

cxp(nlog(l H ) — s) w exp(— s 2 /2n). 

To achieve this approximation rigorously we first scale s by \fn to 
remove the n in the denominator. Making the substitution s = \/nx, 
we obtain 



n< = \Jnn e 



r°° x 

I exp(nlog(l H — ) — \/nx) dx, 



thus extracting the factor of \fn that we know from (1.47) has to be 
there. 

Now, Taylor expansion tells us that for fixed x, we have the point- 
wise convergence 

(1.49) exp(nlog(l + -^=) - y/nx) cxp(-x 2 /2) 

V it 

asn^oo. To be more precise, as the function nlog(l + ^j=) equals 
with derivative \/n at the origin, and has second derivative rq^T^p > 
we see from two applications of the fundamental theorem of calculus 
that 

nlog(l + - y/nx = - [ y%- 2 . 

Vn Jo {l + y/y/n} 2 

This gives a uniform lower bound 

nlog(l H — -j=) — \fnx < —cx 2 
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for some c > when \x\ < \/n, and 

X 

nlog(l H — -=.) — y/nx < ~cx\/n 



n 



for \x\ > y/n. This is enough to keep the integrands exp(nlog(l + 
^7=) — \fnx) dominated by an absolutely integrable function. By 
(1.49) and the Lebesgue dominated convergence theorem, we thus 
have 



A classical computation (based for instance on computing exp(— (x 2 + 

2/ 2 )/2) dxdy in both Cartesian and polar coordinates) shows that 



Remark 1.2.1. The dominated convergence theorem does not imme- 
diately give any effective rate on the decay o(l) (though such a rate 
can eventually be extracted by a quantitative version of the above 
argument. But one can combine (1.50) with (1.47) to show that the 
error rate is of the form 0(l/n). By using fancier versions of the 
trapezoid rule (e.g. Simpson's rule) one can obtain an asymptotic 
expansion of the error term in see [KeVa2007]. 

Remark 1.2.2. The derivation of (1.50) demonstrates some general 
principles concerning the estimation of exponential integrals J dx 
when (f> is large. Firstly, the integral is dominated by the local maxima 
of <f>. Then, near these maxima, e^ x ) usually behaves like a rescaled 
Gaussian, as can be seen by Taylor expansion (though more compli- 
cated behaviour emerges if the second derivative of <j> degenerates). 
So one can often understand the asymptotics of such integrals by a 
change of variables designed to reveal the Gaussian behaviour. This 
technique is known as Laplace's method. A similar set of principles 
also holds for oscillatory exponential integrals J e 1 ^^ dx; these prin- 
ciples are collectively referred to as the method of stationary phase. 





— oo 



and so we conclude Stirling 's formula 



(1.50) 
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One can use Stirling's formula to estimate binomial coefficients. 
Here is a crude bound: 

Exercise 1.2.1 (Entropy formula). Let n be large, let < 7 < 1 be 
fixed, and let 1 < m < n be an integer of the form m = (7 + o(l))n. 
Show that (^) = exp((h(-f) + o(l))n), where ^1(7) is the entropy 
function 

Kl) :=7log- + (l-7)log 1 • 

7 1-7 

For m near n/2, one also has the following more precise bound: 

Exercise 1.2.2 (Refined entropy formula). Let n be large, and let 
1 < m < n be an integer of the form m = n/2+k for some k — o(n 2 / 3 ). 
Show that 

(1.51) P=(yf + (l))^_exp(-2fc 2 /n). 

Note the Gaussian- type behaviour in k. This can be viewed as 
an illustration of the central limit theorem (see Section 2.2) when 
summing iid Bernoulli variables X\, . . . ,X n <E {0, 1}, where each Xi 
has a 1/2 probability of being either or 1. Indeed, from (1.51) we 
see that 

P(Xi + • • • + X n = n/2 + k) = (J- + o(l))^-= cxp(-2A: 2 /n) 

V 7T V n 

when k = o(n 2 ^ 3 ), which suggests that X\ + ■ ■ ■ + X n is distributed 
roughly like the Gaussian N(n/2,n/A) with mean n/2 and variance 
n/4. 

1.3. Eigenvalues and sums of Hermitian matrices 

Let A be a Hermitian n x n matrix. By the spectral theorem for 
Hermitian matrices (which, for sake of completeness, we prove below), 
one can diagonalise A using a sequence 10 

Ai(A) > ... > X n (A) 



■^Thc eigenvalues are uniquely determined by A : but the eigenvectors have a 
little ambiguity to them, particularly if there arc repeated eigenvalues; for instance, 
one could multiply each eigenvector by a complex phase e 1 ® '. In this text we are 
arranging eigenvalues in descending order; of course, one can also arrange eigenvalues 
in increasing order, which causes some slight notational changes in the results below. 
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of n real eigenvalues, together with an orthonormal basis of eigenvec- 
tors ui(A), . . . , u n (A) € C™. The set {Xi(A), . . ., X n (A)} is known as 
the spectrum of A. 

A basic question in linear algebra asks the extent to which the 
eigenvalues Xi(A), . . . , X n (A) and Ai(-B), . . . , X n (B) of two Hcrmitian 
matrices A, B constrain the eigenvalues X\{A + B), . . . , X n (A + B) of 
the sum. For instance, the linearity of trace 

tr(A + B) = tv{A) + tr(B), 

when expressed in terms of eigenvalues, gives the trace constraint 

(1.52) Xi(A + B) -\ + X n (A + B) = X 1 (A) + --- + X n (A) 

+X 1 (B) + ■ ■ ■ + X n (B); 

the identity 

(1.53) Ai(A) = sup v*Av 

M=i 

(together with the counterparts for B and A + B) gives the inequality 

(1.54) X 1 (A + B)<X 1 {A) + X 1 {B); 
and so forth. 

The complete answer to this problem is a fascinating one, requir- 
ing a strangely recursive description (once known as Horn's conjec- 
ture, which is now solved), and connected to a large number of other 
fields of mathematics, such as geometric invariant theory, intersec- 
tion theory, and the combinatorics of a certain gadget known as a 
"honeycomb" . See [KnTa2001] for a survey of this topic. 

In typical applications to random matrices, one of the matrices 
(say, B) is "small" in some sense, so that A+B is a perturbation of A. 
In this case, one does not need the full strength of the above theory, 
and instead rely on a simple aspect of it pointed out in [HeRol995], 
[Tol994], which generates several of the eigenvalue inequalities re- 
lating A, B, and A + B, of which (1.52) and (1-54) are examples 11 . 
These eigenvalue inequalities can mostly be deduced from a number 



Actually, this method eventually generates all of the eigenvalue inequalities, 
but this is a non-trivial fact to prove; sec [KnTaWo2004] 
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of minimax characterisations of eigenvalues (of which (1.53) is a typ- 
ical example), together with some basic facts about intersections of 
subspaces. Examples include the Weyl inequalities 

(1.55) K+-j-i(A + B)< Xi(A) + Xj(B), 

valid whenever i, j > 1 and i + j — 1 < n, and the Ky Fan inequality 
X 1 {A + B) + --- + X k (A + B) < 

(1.56) X 1 (A) + --- + X k (A) + X 1 (B) + • • • + A fc (B). 

One consequence of these inequalities is that the spectrum of a Her- 
mitian matrix is stable with respect to small perturbations. 

We will also establish some closely related inequalities concern- 
ing the relationships between the eigenvalues of a matrix, and the 
eigenvalues of its minors. 

Many of the inequalities here have analogues for the singular val- 
ues of non-Hcrmitian matrices (by exploiting the augmented matrix 
(2.80)). However, the situation is markedly different when dealing 
with eigenvalues of non-Hermitian matrices; here, the spectrum can 
be far more unstable, if pseudo spectrum is present. Because of this, 
the theory of the eigenvalues of a random non-Hcrmitian matrix re- 
quires an additional ingredient, namely upper bounds on the preva- 
lence of pseudospectrum, which after recentering the matrix is basi- 
cally equivalent to establishing lower bounds on least singular values. 
See Section 2.8.1 for further discussion of this point. 

We will work primarily here with Hcrmitian matrices, which can 
be viewed as self-adjoint transformations on complex vector spaces 
such as C". One can of course specialise the discussion to real sym- 
metric matrices, in which case one can restrict these complex vector 
spaces to their real counterparts R™. The specialisation of the com- 
plex theory below to the real case is straightforward and is left to the 
interested reader. 

1.3.1. Proof of spectral theorem. To prove the spectral theorem, 
it is convenient to work more abstractly, in the context of self-adjoint 
operators on finite-dimensional Hilbert spaces: 
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Theorem 1.3.1 (Spectral theorem). Let V be a finite- dimensional 
complex Hilbert space of some dimension n, and let T : V — > V 
be a self-adjoint operator. Then there exists an orthonormal basis 
vi,...,v n £ V ofV and eigenvalues Ai, . . . , A„ £ R such that Tvi = 
XiVi for all 1 < i < n. 

The spectral theorem as stated in the introduction then follows 
by specialising to the case V = C™ and ordering the eigenvalues. 

Proof. Wc induct on the dimension n. The claim is vacuous for 
n = 0, so suppose that n > 1 and that the claim has already been 
proven for n = 1. 

Let v be a unit vector in C" (thus v*v = 1) that maximises the 
form Rev*Tv; this maximum exists by compactness. By the method 
of Lagrange multipliers, v is a critical point of Rev*Tv — Xv*v for 
some A £ R. Differentiating in an arbitrary direction w £ C™, we 
conclude that 

He(v*Tw + w*Tv — Xv*w — Xw*v) = 0; 

this simplifies using self-adjointness to 

Re(w*(Tv - Xv)) = 0. 

Since w £ C" was arbitrary, we conclude that Tv — Xv, thus v 
is a unit eigenvector of T. By self-adjointness, this implies that the 
orthogonal complement v 1 - := {w £ V : v*w = 0} of v is preserved by 
T. Restricting T to this lower-dimensional subspacc and applying the 
induction hypothesis, we can find an orthonormal basis of eigenvectors 
of T on . Adjoining the new unit vector v to the orthonormal basis, 
we obtain the claim. □ 

Suppose we have a self-adjoint transformation A : C" — > C™, 
which of course can be identified with a Hermitian matrix. Using 
the orthogonal eigenbasis provided by the spectral theorem, we can 
perform an orthonormal change of variables to set that eigenbasis 
to be the standard basis e\, . . . , e n , so that the matrix of A becomes 
diagonal. This is very useful when dealing with just a single matrix A 
- for instance, it makes the task of computing functions of A, such as 
A k or exp(tA), much easier. However, when one has several Hermitian 
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matrices in play (e.g. A, B, A + B), then it is usually not possible to 
standardise all the eigenbases simultaneously (i.e. to simultaneously 
diagonalise all the matrices), except when the matrices all commute. 
Nevertheless one can still normalise one of the eigenbases to be the 
standard basis, and this is still useful for several applications, as we 
shall soon see. 

Exercise 1.3.1. Suppose that the eigenvalues Xi(A) > . . . > X n (A) 
of an n x n Hermitian matrix are distinct. Show that the associated 
eigenbasis ui(A), . . . , u n (A) is unique up to rotating each individual 
eigenvector Uj(A) by a complex phase e t0j . In particular, the spectral 
projections Pj{A) := Uj(A)*Uj(A) are unique. What happens when 
there is eigenvalue multiplicity? 

1.3.2. Minimax formulae. The i th eigenvalue functional A i->- Aj (A) 
is not a linear functional (except in dimension one). It is not even a 
convex functional (except when i = 1) or a concave functional (ex- 
cept when i — n). However, it is the next best thing, namely it is a 
minimax expression of linear functional 12 . More precisely, we have 

Theorem 1.3.2 (Courant-Fischer min-max theorem). Let A be an 

n x n Hermitian matrix. Then we have 

(1.57) \{A) = sup inf v*Av 

dim(V)=i veV:\v\=l 

and 

(1.58) Xi{A) = inf sup v*Av 

dim(V)=n-i+l ve y.\v\ = l 

for all 1 < i < n, where V ranges over all subspaces of C™ with the 
indicated dimension. 

Proof. It suffices to prove (1.57), as (1.58) follows by replacing A by 
-A (noting that Aj(-^4) = -\ n - t+1 (A)). 

We first verify the i = 1 case, i.e. (1.53). By the spectral theorem, 
we can assume that A has the standard eigenbasis e\, . . . , e n , in which 



Note that a convex functional is the same thing as a max of linear functionals, 
while a concave functional is the same thing as a min of linear functionals. 
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case we have 

n 

(1.59) v*Av = J2 x i\ v i\ 2 

whenever v — (vi, . . . , v n ). The claim (1.53) is then easily verified. 

To prove the general case, we may again assume A has the stan- 
dard eigenbasis. By considering the space V spanned by e\, . . . , ej, 
we easily see the inequality 

K(A) < sup inf v*Av 

dim(V)=i v ^ V -\ v \= 1 

so we only need to prove the reverse inequality. In other words, 
for every z-dimensional subspace V of C", we have to show that V 
contains a unit vector v such that 

v*Av < Xi(A). 

Let W be the space spanned by e,, . . . , e n . This space has codimension 
i — 1, so it must have non-trivial intersection with V. If we let v be 
a unit vector in V n W, the claim then follows from (1.59). □ 

Remark 1.3.3. By homogeneity, one can replace the restriction \v\ — 
1 with v provided that one replaces the quadratic form v*Av with 
the Rayleigh quotient v*Av/v*v. 

A closely related formula is as follows. Given ainxn Hermitian 
matrix A and an m-dimensional subspace V of C", wc define the 
partial trace ti(A [v) to be the expression 

m 

tr(A W) :=J2 v tM 

i=l 

where v\, . . . , v m is any orthonormal basis of V . It is easy to see that 
this expression is independent of the choice of orthonormal basis, and 
so the partial trace is well-defined. 

Proposition 1.3.4 (Extremal partial trace). Let A be annxn Her- 
mitian matrix. Then for any 1 < k < n, one has 

Ai(A) + --- + A fc (A) = sup tr(A[ v ) 

dim(V)=fc 
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and 

X n - k+ i(A) + ■■■ + X n (A) = inf tr(A [ v ). 

dim(y)— k 

As a corollary, we see that A >->• \\{A) + • • • + Xk(A) is a convex 
function, and A i->- \ n -k+i(A) + • • • + X n (A) is a concave function. 

Proof. Again, by symmetry it suffices to prove the first formula. 
As before, we may assume without loss of generality that A has the 
standard eigenbasis e\, . . . , e„ corresponding to Xi(A), . . . , X n (A) re- 
spectively. By selecting V to be the span of e\, . . . , e k we have the 
inequality 

\i{A) + ■ ■ ■ + \ k (A) < sup ti(A[ v ) 

dim(V)=k 

so it suffices to prove the reverse inequality. For this we induct on the 
dimension n. If V has dimension k, then it has a k — 1-dimensional 
subspace V that is contained in the span of e2, . . . ,e„. By the in- 
duction hypothesis applied to the restriction of A to this span (which 
has eigenvalues X2(A), . . . , X n (A)), we have 

X 2 (A) + ■ ■ ■ + X k (A) >tr(A W)- 

On the other hand, if v is a unit vector in the orthogonal complement 
of V in V, we see from (1.53) that 

Xi(A) > v*Av. 

Adding the two inequalities we obtain the claim. □ 

Specialising Proposition 1.3.4 to the case when V is a coordi- 
nate subspace (i.e. the span of k of the basis vectors ei, . . . , e„), we 
conclude the Schur-Horn inequalities 

X n -k+i{A) + ■ ■ ■ + X n {A) < 

(1.60) a ilh + ■■■ + a lklk < Ai(A) + . . . + X k (A) 

for any 1 < i\ < . . . < i k < n, where an, 022, • ■ • , aw are the diagonal 
entries of A. 
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Exercise 1.3.2. Show that the inequalities (1.60) are equivalent to 
the assertion that the diagonal entries diag(A) = (an, ft22, • • • , a nn ) 
lies in the permutahedron of Ai(^4), . . . , X n {A), defined as the convex 
hull of the n\ permutations of (Ai(A), . . . , \ n {A)) in R™. 

Remark 1.3.5. It is a theorem of Schur and Horn[Hol954] that 
these are the complete set of inequalities connecting the diagonal 
entries diag(A) = {an, «22, • • • , a nn ) of a Hermitian matrix to its 
spectrum. To put it another way, the image of any coadjoint orbit 
Oa '■= {UAU* : U € U(n)} of a matrix A with a given spectrum 
Ai, . . . , A„ under the diagonal map diag : A \-t diag(^4) is the permu- 
tahedron of Ai, . . . , A„. Note that the vertices of this permutahedron 
can be attained by considering the diagonal matrices inside this coad- 
joint orbit, whose entries are then a permutation of the eigenvalues. 
One can interpret this diagonal map diag as the moment map as- 
sociated with the conjugation action of the standard maximal torus 
of U(n) (i.e. the diagonal unitary matrices) on the coadjoint or- 
bit. When viewed in this fashion, the Schur-Horn theorem can be 
viewed as the special case of the more general Atiyah convexity theo- 
rem [Atl982] (also proven independently by Guillemin and Sternberg 
[GuStl982]) in symplcctic geometry. Indeed, the topic of eigenval- 
ues of Hermitian matrices turns out to be quite profitably viewed as a 
question in symplectic geometry (and also in algebraic geometry, par- 
ticularly when viewed through the machinery of geometric invariant 
theory). 

There is a simultaneous generalisation of Theorem 1.3.2 and Propo- 
sition 1.3.4: 

Exercise 1.3.3 (Wiclandt minimax formula). Let 1 < i\ < . . . < 
ik < n be integers. Define a partial flag to be a nested collection 
V\ C . . . C Vfe of subspaces of C™ such that dim(Vj) = ij for all 
1 < j < k. Define the associated Schubert variety X(V\, . . . , Vk) to 
be the collection of all fc-dimensional subspaces W such that dim(14 7 n 
Vj) ^ j- Show that for any n x n matrix A, 

A n (A) + ■ ■ ■ + X lk (A) = sup inf tx(A [ w ). 

Vi,...,v fc wex(v u ...,v k ) 
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1.3.3. Eigenvalue inequalities. Using the above minimax formu- 
lae, we can now quickly prove a variety of eigenvalue inequalities. The 
basic idea is to exploit the linearity relationship 

(1.61) v*(A + B)v = v*Av + v*Bv 
for any unit vector v, and more generally 

(1.62) tr((A + B) W) = tr(A \ v ) + tr(B [ v ) 

for any subspace V. 

For instance, as mentioned before, the inequality (1.54) follows 
immediately from (1.53) and (1.61). Similarly, for the Ky Fan in- 
equality (1.56), one observes from (1.62) and Proposition 1.3.4 that 

tr((A + B) lw) < tr(A [ w ) + X X {B) + ■ ■ ■ + \ k {B) 

for any /c-dimcnsional subspace W. Substituting this into Proposition 
1.3.4 gives the claim. If one uses Exercise 1.3.3 instead of Proposition 

1.3.4, one obtains the more general Lidskii inequality 

\ ll (A + B) + --- + X lk (A + B) 

< A 4l (A) + ■ ■ ■ + X ik (A) + X, (B) + ■ ■ ■ + X k (B) 

for any 1 < i\ < . . . < i k < n. 

In a similar spirit, using the inequality 

\v*Bv\ < \\B\\ op = maxdA^S)!, \X n (B)\) 

for unit vectors v, combined with (1.61) and (1.57), we obtain the 
eigenvalue stability inequality 

(1.64) \Xi(A + B)-Xi{A)\ < \\B\\ op , 

thus the spectrum of A + B is close to that of A if B is small in 
operator norm. In particular, we see that the map A Xi(A) is 
Lipschitz continuous on the space of Hermitian matrices, for fixed 
1 < i < n. 

More generally, suppose one wants to establish the Weyl inequal- 
ity (1.55). From (1.57) that it suffices to show that every i + j — 1- 
dimensional subspace V contains a unit vector v such that 

v*(A + B)v < Xi(A) + Xj(B). 
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But from (1.57), one can find a subspace U of codimension i — 1 such 
that v* Av < \i(A) for all unit vectors v in U, and a subspace W of 
codimension j — 1 such that v*Bv < Xj(B) for all unit vectors v in 
W. The intersection U C\W has codimension at most i + j — 2 and 
so has a nontrivial intersection with V; and the claim follows. 

Remark 1.3.6. More generally, one can generate an eigenvalue in- 
equality whenever the intersection numbers of three Schubert varieties 
of compatible dimensions is non-zero; see [HeRol995]. In fact, this 
generates a complete set of inequalities; see [Klyachko]. One can 
in fact restrict attention to those varieties whose intersection num- 
ber is exactly one; see [KnTaWo2004] . Finally, in those cases, the 
fact that the intersection is one can be proven by entirely elementary 
means (based on the standard inequalities relating the dimension of 
two subspaces V, W to their intersection VOW and sum V + W) ; see 
[BeCoDyLiTi2010]. As a consequence, the methods in this section 
can, in principle, be used to derive all possible eigenvalue inequalities 
for sums of Hermitian matrices. 

Exercise 1.3.4. Verify the inequalities (1.63) and (1.55) by hand 
in the case when A and B commute (and are thus simultaneously 
diagonalisable) , without the use of minimax formulae. 

Exercise 1.3.5. Establish the dual Lidskii inequality 

X il (A + B) + --- + X ik (A + B) > X^A) + ■ ■ ■ + X ik {A) 

(B) H + X n (B) 

for any 1 < i\ < . . . < ik < n and the dual Weyl inequality 

X i+j - n (A + B)> Xi(A) + Xj(B) 
whenever 1 < i, j,i + j — n < n. 

Exercise 1.3.6. Use the Lidskii inequality to establish the more gen- 
eral inequality 

n n n 

^K{A + B)<J2 cMA) + c*Xi(B) 

i—l i—1 i—1 

whenever ci,...,c n > 0, and c\ > . . . > c* > is the decreasing 
rearrangement of c\, . . . , c n . {Hint: express Ci as the integral of l(ci > 
A) as A runs from to infinity. For each fixed A, apply (1.63).) 
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Combine this with Holder's inequality to conclude the p-Weilandt- 
Hoffman inequality 

(1.65) ||(Aj(A + B)- Xi(A))2 = i\\ e p < \\B\\ S p 

for any 1 < p < oo, where 

n 

ii(o<)? =1 ik := (£kn 1/p 

i=l 

is the usual £ p norm (with the usual convention that := 
su Pi<*< P N), and 

(1-66) ||B||sp ■■= \\(MBm =1 \\c 

is the p-Schatten norm of £>. 

Exercise 1.3.7. Show that the p-Schatten norms are indeed norms 
on the space of Hermitian matrices for every 1 < p < oo. 

Exercise 1.3.8. Show that for any 1 < p < oo and any Hermitian 
matrix A = {aij)i<i.j< ni one has 

(1-67) ll(aii)?=ilk < UWs- 

Exercise 1.3.9. Establish the non- commutative Holder inequality 

\tr(AB)\ < \\A\\ S p\\B\\ Spl 

whenever 1 < p,p' < oo with 1/p + 1/p' = 1, and A, B are n x n 
Hermitian matrices. (Hint: Diagonalise one of the matrices and use 
the preceding exercise.) 

The most important 13 p-Schatten norms are the oo-Schatten norm 
\\A\\s°° — || A|j op , which is just the operator norm, and the 2-Schatten 
norm ||^4|| ^ 2 = EiLi -^i(^) 2 ) 1 ^ 2 ! which is also the Frobenius norm 
(or Hilbert- Schmidt norm) 

n n 

\\A\\s> = \\A\\ F := tr(AA*) 1/2 = £ k/) 1 / 2 

i=l j=l 



The 1-Schatten norm S 1 , also known as the nuclear norm or trace class norm, 
is important in a number of applications, such as matrix completion, but will not be 
used in this text. 
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where the coefficients of A. Thus we see that the p = 2 case 

of the Weilandt-Hoffman inequality can be written as 



We will give an alternate proof of this inequality, based on eigenvalue 
deformation, in the next section. 

1.3.4. Eigenvalue deformation. From the Weyl inequality (1.64), 
we know that the eigenvalue maps A K(A) are Lipschitz contin- 
uous on Hermitian matrices (and thus also on real symmetric matri- 
ces). It turns out that we can obtain better regularity, provided that 
we avoid repeated eigenvalues. Fortunately, repeated eigenvalues are 
rare: 

Exercise 1.3.10 (Dimension count). Suppose that n > 2. Show that 
the space of Hermitian matrices with at least one repeated eigenvalue 
has codimension 3 in the space of all Hermitian matrices, and the 
space of real symmetric matrices with at least one repeated eigenvalue 
has codimension 2 in the space of all real symmetric matrices. (When 
n = 1, repeated eigenvalues of course do not occur.) 

Let us say that a Hermitian matrix has simple spectrum if it has 
no repeated eigenvalues. We thus see from the above exercise and 
(1.64) that the set of Hermitian matrices with simple spectrum forms 
an open dense set in the space of all Hermitian matrices, and similarly 
for real symmetric matrices; thus simple spectrum is the generic be- 
haviour of such matrices. Indeed, the unexpectedly high codimension 
of the non-simple matrices (naively, one would expect a codimension 
1 set for a collision between, say, Xi(A) and Xi + i(A)) suggests a re- 
pulsion phenomenon: because it is unexpectedly rare for eigenvalues 
to be equal, there must be some "force" that "repels" eigenvalues 
of Hermitian (and to a lesser extent, real symmetric) matrices from 
getting too close to each other. We now develop some machinery to 
make this more precise. 

We first observe that when A has simple spectrum, the zeroes of 
the characteristic polynomial A dct(^4 — XI) are simple (i.e. the 
polynomial has nonzero derivative at those zeroes). From this and 



n 



(1.68) 
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the inverse function theorem, we see that each of the eigenvalue maps 
A i-> Xi(A) are smooth on the region where A has simple spectrum. 
Because the eigenvectors Ui(A) arc determined (up to phase) by the 
equations (A — Xi(A)I)iii(A) = and Ui{A)*Ui{A) = 1, another appli- 
cation of the inverse function theorem tells us that we can (locally 14 ) 
select the maps A i-> Ui(A) to also be smooth. 

Now suppose that A = A(t) depends smoothly on a time variable 
t, so that (when A has simple spectrum) the eigenvalues Aj(t) = 
Xi(A(t)) and eigenvectors Ui(t) — Ui(A(t)) also depend smoothly on 
t. We can then differentiate the equations 

(1.69) Aui = XiUi 
and 

(1.70) u*Ui = 1 

to obtain various equations of motion for Aj and Ui in terms of the 
derivatives of A. 

Let's see how this works. Taking first derivatives of (1.69), (1.70) 
using the product rule, we obtain 

(1.71) Aui + Aui = \ui + Xiiii 
and 

(1.72) u*Ui + u*ut = 0. 

The equation (1.72) simplifies to u*Ui = 0, thus in is orthogonal to Uj. 
Taking inner products of (1.71) with u t , we conclude the Hadamard 
first variation formula 

(1.73) X t = u*A Ui . 

This can already be used to give alternate proofs of various eigen- 
value identities. For instance, if we apply this to A(t) := A + tB, we 
see that 

^-Xi(A + tB) = Ui (A + tB)*B Ul {A + tB) 



There may be topological obstructions to smoothly selecting these vectors 
globally, but this will not concern us here as we will be performing a local analy- 
sis only. In some applications, it is more convenient not to work with the ui{A) 
at all due to their phase ambiguity, and work instead with the spectral projections 
Pi(A) :— Ui (A)ui ( A)* , which do not have this ambiguity. 
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whenever A + tB has simple spectrum. The right-hand side can be 
bounded in magnitude by ||_B|| op , and so we see that the map t n- 
Xi(A + tB) is Lipschitz continuous, with Lipschitz constant ||B|| p 
whenever A + tB has simple spectrum, which happens for generic 
A, B (and all t) by Exercise 1.3.10. By the fundamental theorem of 
calculus, we thus conclude (1.64). 

Exercise 1.3.11. Use a similar argument to the one above to estab- 
lish (1.68) without using minimax formulae or Lidskii's inequality. 

Exercise 1.3.12. Use a similar argument to the one above to deduce 
Lidskii's inequality (1.63) from Proposition 1.3.4 rather than Exercise 



One can also compute the second derivative of eigenvalues: 

Exercise 1.3.13. Suppose that A = A(t) depends smoothly on t. 
By differentiating (1.71) and (1.72), establish the Hadamard second 
variation formula 15 



whenever A has simple spectrum and 1 < k < n. 

Remark 1.3.7. In the proof of the four moment theorem[Ta Vu2009b] 
on the fine spacing of Wigner matrices, one also needs the variation 
formulae for the third, fourth, and fifth derivatives of the eigenvalues 
(the first four derivatives match up with the four moments mentioned 
in the theorem, and the fifth derivative is needed to control error 
terms). Fortunately, one does not need the precise formulae for these 
derivatives (which, as one can imagine, are quite complicated), but 
only their general form, and in particular an upper bound for these 
derivatives in terms of more easily computable quantities. 



If one interprets the second derivative of the eigenvalues as being proportional 
to a "force" on those eigenvalues (in analogy with Newton's second law), (1.74) is 
asserting that each eigenvalue \j "repels" the other eigenvalues A fc by exerting a force 
that is inversely proportional to their separation (and also proportional to the square of 
the matrix coefficient of A in the cigenbasis). Sec [Ta2009b, §1.5] for more discussion. 



1.3.3. 



(1.74) 
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1.3.5. Minors. In the previous sections, we perturbed nxn Hermit- 
ian matrices A = A n by adding a (small) nxn Hermitian correction 
matrix B to them to form a new nxn Hermitian matrix A + B. 
Another important way to perturb a matrix is to pass to a principal 
minor, for instance to the top left n-lxn-1 minor A„_i of A n . 
There is an important relationship between the eigenvalues of the two 
matrices: 

Exercise 1.3.14 (Cauchy interlacing law). For any nxn Hermitian 
matrix A n with top left n-lxn-1 minor A n _\, then 

(1.75) h+l{A n ) < K(An-l) < HA n ) 

for all 1 < i < n. (Hint: use the Courant-Fischer min-max theorem, 
Theorem 1.3.2.) Show furthermore that the space of A n for which 
equality holds in one of the inequalities in (1.75) has codimension 2 
(for Hermitian matrices) or 1 (for real symmetric matrices). 

Remark 1.3.8. If one takes successive minors A n -i, A n _2, ■ ■ ■ , A\ 
of an n x n Hermitian matrix A n , and computes their spectra, then 
(1.75) shows that this triangular array of numbers forms a pattern 
known as a Gelfand-Tsetlin pattern. 

One can obtain a more precise formula for the eigenvalues of A n 
in terms of those for A n _\: 

Exercise 1.3.15 (Eigenvalue equation). Let A n be an n x n Hermit- 
ian matrix with top left n-lxn-1 minor A n _i. Suppose that A 
is an eigenvalue of A n distinct from all the eigenvalues of ^4„_i (and 
thus simple, by (1.75)). Show that 

(L76) £ A,G4„-i)-A - a "« - 

where a nn is the bottom right entry of A, and X = (a n j)™zl € C™ -1 
is the right column of A (minus the bottom entry). (Hint: Expand out 
the eigenvalue equation A n u = Xu into the C n_1 and C components.) 
Note the similarities between (1.76) and (1.74). 

Observe that the function A — > ^j=i ^^{a'^-x 1S a ra tional 
function of A which is increasing away from the eigenvalues of A n _i, 
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where it has a pole (except in the rare case when the inner prod- 
uct Uj-i(A n -i)*X vanishes, in which case it can have a removable 
singularity). By graphing this function one can see that the interlac- 
ing formula (1.75) can also be interpreted as a manifestation of the 
intermediate value theorem. 

The identity (1-76) suggests that under typical circumstances, 
an eigenvalue A of A n can only get close to an eigenvalue Aj(A„_i) 
if the associated inner product Uj(A n _i)*X is small. This type of 
observation is useful to achieve eigenvalue repulsion - to show that 
it is unlikely that the gap between two adjacent eigenvalues is small. 
We shall see examples of this in later sections. 

1.3.6. Singular values. The theory of eigenvalues ofnxn Hcrmit- 
ian matrices has an analogue in the theory of singular values ofpxn 
non-Hermitian matrices. We first begin with the counterpart to the 
spectral theorem, namely the singular value decomposition. 

Theorem 1.3.9 (Singular value decomposition). Let < p < n, 
and let A be a linear transformation from an n-dimensional complex 
Hilbert space U to a p-dimensional complex Hilbert space V. (In par- 
ticular, A could be an px n matrix with complex entries, viewed as a 
linear transformation from C™ to C p .) Then there exist non-negative 
real numbers 

o-i(A) > ... > cr p (A) > 

(known as the singular values of A) and orthonormal sets Ui(A), . . . , u p (A) € 
U and Vi(A), . . . ,v p (A) G V (known as singular vectors of A), such 
that 

Au^o-jVj] A*v j =a j u j 

for all 1 < j < p, where we abbreviate Uj = uj(A), etc. 

Furthermore, Au — whenever u is orthogonal to all of the 
mi (A), . . . ,u p (A). 

We adopt the convention that Oi (A) = for i > p. The above the- 
orem only applies to matrices with at least as many rows as columns, 
but one can also extend the definition to matrices with more columns 
than rows by adopting the convention ai(A*) := ai(A) (it is easy to 
check that this extension is consistent on square matrices). All of 
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the results below extend (with minor modifications) to the case when 
there are more columns than rows, but we have not displayed those 
extensions here in order to simplify the notation. 

Proof. We induct on p. The claim is vacuous for p = 0, so suppose 
that p > 1 and that the claim has already been proven for p — 1. 

We follow a similar strategy to the proof of Theorem 1.3.1. We 
may assume that A is not identically zero, as the claim is obvious 
otherwise. The function u ^ ||^4u|| 2 is continuous on the unit sphere 
of U, so there exists a unit vector u\ which maximises this quantity. If 
we set <Ji :— || Aui\\ > 0, one easily verifies that u\ is a critical point of 
the map «n ||^4u|| 2 — cr 2 ||u|| 2 , which then implies that A* Au\ = o\u\. 
Thus, if we set v\ :— Au\/(Ti, then Au\ — <J\V\ and A*v\ — a\U\. 
This implies that A maps the orthogonal complement uj^ of u\ in U to 
the orthogonal complement v± of vi in V. By induction hypothesis, 
the restriction of A to (and v^) then admits a singular value 
decomposition with singular values 02 > . . . > <J V > and singular 
vectors U2, ■ ■ ■ ,u p € u^, V2, ■ ■ ■ , v p e v^- with the stated properties. 
By construction we see that 02, . . . , <r p are less than or equal to o\ . If 
we now adjoin u\,u\,v\ to the other singular values and vectors we 
obtain the claim. □ 

Exercise 1.3.16. Show that the singular values <Ji(A) > ... > 
&p(A) > of a p x n matrix A arc unique. If we have <Ti(A) > 
. . . > <y p (A) > 0, show that the singular vectors are unique up to 
rotation by a complex phase. 

By construction (and the above uniqueness claim) we see that 
(Ti(UAV) = <7i(A) whenever A is a p x n matrix, U is a unitary p x p 
matrix, and V is a unitary n x n matrix. Thus the singular spectrum 
of a matrix is invariant under left and right unitary transformations. 

Exercise 1.3.17. If A is a p x n complex matrix for some 1 < p < n, 
show that the augmented matrix 



is a,p + nxp + n Hcrmitian matrix whose eigenvalues consist of 
±(Ji(A), . . . , ±a p (A), together with n — p copies of the eigenvalue 
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zero. (This generalises Exercise 2.3.17.) What is the relationship 
between the singular vectors of A and the eigenvectors of A7 

Exercise 1.3.18. If A is an n x n Hermitian matrix, show that the 
singular values <j\ (A) , . . . ,a n (A) of A are simply the absolute values 
|Ai(A)|, . . . , |A„(A)| of A, arranged in descending order. Show that 
the same claim also holds when A is a normal matrix (that is, when 
A commutes with its adjoint). What is the relationship between the 
singular vectors and eigenvectors of A? 

Remark 1.3.10. When A is not normal, the relationship between 
eigenvalues and singular values is more subtle. We will discuss this 
point in later sections. 

Exercise 1.3.19. If ^4 is a p x n complex matrix for some 1 < p < n, 
show that AA* has eigenvalues ai(A) 2 , . . . , a p (A) 2 , and A* A has 
eigenvalues ai(A) 2 , . . . , a p (A) 2 together with n— p copies of the eigen- 
value zero. Based on this observation, give an alternate proof of the 
singular value decomposition theorem using the spectral theorem for 
(positive semi-definite) Hermitian matrices. 

Exercise 1.3.20. Show that the rank of a p x n matrix is equal to 
the number of non-zero singular values. 

Exercise 1.3.21. Let A be a p x n complex matrix for some 1 < p < 
n. Establish the Courant-Fischer min-max formula 

(1.77) *i(A)= sup inf \Av\ 

for all 1 < i < p, where the supremum ranges over all subspaces of 
C n of dimension i. 

One can use the above exercises to deduce many inequalities 
about singular values from analogous ones about eigenvalues. We 
give some examples below. 

Exercise 1.3.22. Let A, B be p x n complex matrices for some 1 < 
p < n. 

(i) Establish the Weyl inequality cr i+ j_i(A + B) < (Ti(A) + 
o-j (B) whenever 1 < i, j, i + j — 1 < p. 
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(ii) Establish the Lidskii inequality 

a il (A + B) + --- + a ik (A + B)< a h (A) + ■ ■ ■ + cr lk (A) 

+at{B) + ■ ■ ■ + a k {B) 
whenever 1 < i\ < . . . < i k < p. 

(iii) Show that for any 1 < k < p, the map A i->- g x (A) + • • • + 
<Jk{A) defines a norm on the space C pxn of complex p x n 
matrices (this norm is known as the k th Ky Fan norm). 

(iv) Establish the Weyl inequality \a t (A + B) - <Ji{A)\ < \\B\\ op 
for all 1 < i < p. 

(v) More generally, establish the q-Weilandt-Hoffman inequality 
Wia^A + B) - cri(A))i<i<p|| £ 9 < \\B\\ S « for any 1 < q < oo, 
where ||-B||s9 := || {(?i{B))\<i< p \\i<i is the g-Schatten norm of 
B. (Note that this is consistent with the previous definition 
of the Schatten norms.) 

(vi) Show that the g-Schatten norm is indeed a norm on C pxn 
for any 1 < q < oo. 

(vii) If A' is formed by removing one row from A, show that 
K+i{A) < Xi(A') < Aj(A) for all 1 < i < p. 

(viii) If p < n and A' is formed by removing one column from A, 
show that Xi+i(A) < Xi(A') < Xi(A) for all 1 < i < p and 
Xp(A') < X p (A). What changes when p = nl 

Exercise 1.3.23. Let A be a p x n complex matrix for some 1 < p < 
n. Observe that the linear transformation A : C" — > C p naturally 
induces a linear transformation A Ak : /\ k C™ — > f\ k C p from fc-forms 
on C" to fc-forms on C p . We give /\ k C" the structure of a Hilbert 
space by declaring the basic forms A . . . A ei k for 1 < i\ < . . . < 
ik < n to be orthonormal. 

For any 1 < k < p, show that the operator norm of A Ak is equal 
to ai(A)...a k (A). 

Exercise 1.3.24. Let A be a p x n matrix for some 1 < p < n, let 
B be a r x p matrix, and let C be a n x s matrix for some r, s > 1. 

Show that n t {BA) < \\B\\ op a t (A) and a l {AC) < ai(A)\\C\\ op for 
any 1 < i < p. 
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Exercise 1.3.25. Let A = (aij)\<i< p -i<j< n be a p x n matrix for 
some 1 < p < n, let i\, . . . , ik € {l,...,p} be distinct, and let 
jii ■ ■ ■ ijk € {1) • ■ • ) n} be distinct. Show that 

a ilh +■■■+ a lkJk < c7i(A) + • • • + a k (A). 

Using this, show that if ji, . . . ,j p e {1, . . . , n} are distinct, then 

IIKJLilk < II^Hs. 

for every 1 < q < oo. 

Exercise 1.3.26. Establish the Holder inequality 

|tr(^B*)|<||A|| s ,||B|| S4 , 

whenever A, B are pxn complex matrices and 1 < q, q 1 < oo are such 
that 1/q+l/q' = 1. 



Chapter 2 



Random matrices 
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2.1. Concentration of measure 

Suppose we have a large number of scalar random variables X\ , . . . , X n 
which each have bounded size on average (e.g. their mean and vari- 
ance could be 0(1)). What can one then say about their sum S n '. — 
X\ + • • ■ + X n ? If each individual summand varies in an interval 
of size 0(1), then their sum of course varies in an interval of size 
0(n). However, a remarkable phenomenon, known as concentration 
of measure, asserts that assuming a sufficient amount of indepen- 
dence between the component variables X\, . . . , X n , this sum sharply 
concentrates in a much narrower range, typically in an interval of size 
0(y/n). This phenomenon is quantified by a variety of large deviation 
inequalities that give upper bounds (often exponential in nature) on 
the probability that such a combined random variable deviates sig- 
nificantly from its mean. The same phenomenon applies not only to 
linear expressions such as S n = X\ + • • • + X n , but more generally 
to nonlinear combinations F(X±, . . . , X n ) of such variables, provided 
that the nonlinear function F is sufficiently regular (in particular, 
if it is Lipschitz, cither separately in each variable, or jointly in all 
variables). 

The basic intuition here is that it is difficult for a large number 
of independent variables X\ , . . . , X n to "work together" to simulta- 
neously pull a sum X\ + • • • + X n or a more general combination 
F(Xi, . . . , X n ) too far away from its mean. Independence here is the 
key; concentration of measure results typically fail if the Xi are too 
highly correlated with each other. 

There are many applications of the concentration of measure phe- 
nomenon, but we will focus on a specific application which is useful 
in the random matrix theory topics we will be studying, namely on 
controlling the behaviour of random n-dimensional vectors with inde- 
pendent components, and in particular on the distance between such 
random vectors and a given subspace. 

Once one has a sufficient amount of independence, the concentra- 
tion of measure tends to be sub-gaussian in nature; thus the proba- 
bility that one is at least A standard deviations from the mean tends 
to drop off like Cexp(— cA 2 ) for some C, c > 0. In particular, one 
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is 0(log ' n) standard deviations from the mean with high prob- 
ability, and 0(log 1 / 2+e n) standard deviations from the mean with 
overwhelming probability. Indeed, concentration of measure is our 
primary tool for ensuring that various events hold with overwhelming 
probability (other moment methods can give high probability but 
have difficulty ensuring overwhelming probability). 

This is only a brief introduction to the concentration of mea- 
sure phenomenon. A systematic study of this topic can be found in 
[Le2001]. 

2.1.1. Linear combinations, and the moment method. We be- 
gin with the simple setting of studying a sum S n := X\ H + X n of 

random variables. As we shall see, these linear sums are particularly 
amenable to the moment method, though to use the more powerful 
moments, we will require more powerful independence assumptions 
(and, naturally, we will need more moments to be finite or bounded). 
As such, we will take the opportunity to use this topic (large deviation 
inequalities for sums of random variables) to give a tour of the mo- 
ment method, which we will return to when we consider the analogous 
questions for the bulk spectral distribution of random matrices. 

In this section we shall concern ourselves primarily with bounded 
random variables; in the next section we describe the basic truncation 
method that can allow us to extend from the bounded case to the 
unbounded case (assuming suitable decay hypotheses). 

The zeroth moment method gives a crude upper bound when S 
is non-zero, 

n 

(2.1) P(S„^0) <]TP(A^0) 

i=l 

but in most cases this bound is worse than the trivial bound P(S n ^ 
0) < 1. This bound, however, will be useful when performing the 
truncation trick, which we will discuss below. 

The first moment method is somewhat better, giving the bound 

n 

i=i 
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which when combined with Markov's inequality (1.14) gives the rather 
weak large deviation inequality 

(2.2) P(|S„|>A)<-^EPQ|. 

i=l 

As weak as this bound is, this bound is sometimes sharp. For in- 
stance, if the Xi are all equal to a single signed Bernoulli variable 
X E {— 1,+1}, then S n = nX, and so \S n \ — n, and so (2.2) is 
sharp when A = n. The problem here is a complete lack of inde- 
pendence; the Xi are all simultaneously positive or simultaneously 
negative, causing huge fluctuations in the value of S n . 

Informally, one can view (2.2) as the assertion that S n typically 
has size S n = 0(X)" =1 \Xi\). 

The first moment method also shows that 

n 

E5„ = ^ EXi 

i=l 

and so we can normalise out the means using the identity 

n 

S n — ES n = y ' Xi — EXi. 

i=l 

Replacing the Xi by Xi — EA^ (and S n by S n — ES n ) we may thus 
assume for simplicity that all the Xi have mean zero. 

Now we consider what the second moment method gives us. We 
square S n and take expectations to obtain 

n n 

EISVJ 2 = ^ ^ EXiXj. 

t=l j = l 

If we assume that the Xi are pairwise independent (in addition to 
having mean zero), then EJQX,- vanishes unless i — j, in which case 
this expectation is equal to Var(Aj). We thus have 



(2.3) 



n 

V a r(S n ) =^Var(A 4 ) 
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which when combined with Chebyshev's inequality (1.26) (and the 
mean zero normalisation) yields the large deviation inequality 



Without the normalisation that the Xi have mean zero, we obtain 



Informally, this is the assertion that S n typically has size S n = 
ES' n +0((^™ =1 Var(Xi)) 1 / 2 ), if we have pairwise independence. Note 
also that we do not need the full strength of the pairwise indepen- 
dence assumption; the slightly weaker hypothesis of being pairwise 
uncorrelated 1 would have sufficed. 

The inequality (2.5) is sharp in two ways. Firstly, we cannot 
expect any significant concentration in any range narrower than the 
standard deviation 0((X^=i Var(Xi)) 1 / 2 ), as this would likely con- 
tradict (2.3). Secondly, the quadratic-type decay in A in (2.5) is sharp 
given the pairwise independence hypothesis. For instance, suppose 
that n = 2 m — 1, and that Xj := (— l) aj ' Y , where Y is drawn uniformly 
at random from the cube {0, l} m , and a\, . . . , a n are an enumeration 
of the non-zero elements of {0, Then a little Fourier analysis 
shows that each Xj for 1 < j < n has mean zero, variance 1, and are 
pairwise independent in j; but S is equal to (n+ 1)I(F = 0) — 1, which 
is equal to n with probability l/(n + 1); this is despite the standard 
deviation of S being just y/n. This shows that (2.5) is essentially (i.e. 
up to constants) sharp here when X = n. 

Now we turn to higher moments. Let us assume that the Xi are 
normalised to have mean zero and variance at most 1, and are also 
almost surely bounded in magnitude by some 2 K: \Xi\ < K . To 
simplify the exposition very slightly we will assume that the Xi are 
real-valued; the complex-valued case is very analogous (and can also 
be deduced from the real-valued case) and is left to the reader. 



In other words, wc only need to assume that Cov(X^,Xj) :— E(X Z — 
EXi)(Xj — EXj) vanishes for all distinct 

2 Notc that wc must have K > 1 to be consistent with the unit variance hypothesis. 



(2.4) 




(2.5) 
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Let us also assume that the X\,..., X n are A;- wise independent 
for some even positive integer k. With this assumption, we can now 
estimate the k th moment 

E|<SVi| fc = EXi 1 ...Xi k . 

l<ii, •••>**<« 

To compute the expectation of the product, we can use the fc-wise 
independence, but we need to divide into cases (analogous to the i ^ j 
and i = j cases in the second moment calculation above) depending 
on how various indices are repeated. If one of the Xi j only appear 
once, then the entire expectation is zero (since Xy has mean zero), 
so we may assume that each of the X ij appear at least twice. In 
particular, there are at most k/2 distinct Xj which appear. If exactly 
k/2 such terms appear, then from the unit variance assumption we 
see that the expectation has magnitude at most 1; more generally, if 
k/2 — r terms appear, then from the unit variance assumption and 
the upper bound by K we see that the expectation has magnitude at 
most K 2r . This leads to the upper bound 

k/2 

E|5„| fc <^K 2r N r 

r=0 

where N r is the number of ways one can select integers ii, . . . , if. in 
{l,...,n} such that each ij appears at least twice, and such that 
exactly k/2 — r integers appear. 

We are now faced with the purely combinatorial problem of es- 
timating N r . We will use a somewhat crude bound. There are 
(fe/2-r) — nk ^ 2 ~ r /{k/2 — r)\ ways to choose k/2 — r integers from 
{1, . . . ,n}. Each of the integers ij has to come from one of these 
k/2 — r integers, leading to the crude bound 

„fc/2-r 

which after using a crude form n! > n n er n of Stirling's formula (see 
Section 1.2) gives 

N r < (en) k/2 - r (k/2) k/2+r , 
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and so 

k/2 K^k 
E|5„| fc <( e nfc/2f/ 2 V( — f. 

£ -~' en 

If we make the mild assumption 

(2.6) K 2 < n/k 

then from the geometric series formula we conclude that 

E|5„| fe < 2(enfc/2) fc / 2 
(say), which leads to the large deviation inequality 

(2.7) P(|5„|>A^)<2(^ 2 ) fe . 

This should be compared with (2.2), (2.5). As k increases, the rate of 
decay in the A parameter improves, but to compensate for this, the 
range that S n concentrates in grows slowly, to 0(Vnk) rather than 
O(Vn). 

Remark 2.1.1. Note how it was important here that k was even. 
Odd moments, such as ES" 3 , can be estimated, but due to the lack 
of the absolute value sign, these moments do not give much usable 
control on the distribution of the S n . One could be more careful in 
the combinatorial counting than was done here, but the net effect 
of such care is only to improve the explicit constants such as y/e/2 
appearing in the above bounds. 

Now suppose that the X\ , . . . , X n are not just fc-wise independent 
for any fixed k, but are in fact jointly independent. Then we can apply 

(2.7) for any k obeying (2.6). We can optimise in k by setting \fnk 
to be a small multiple of A, and conclude the Gaussian-type bound 3 

(2.8) P(|S„| > Xy/n) < Ccxp(-cA 2 ) 

for some absolute constants C, c > 0, provided that |A| < c^/n/^fK 
for some small c. Thus we see that while control of each individual 
moment ElSVJ* 1 only gives polynomial decay in A, by using all the 
moments simultaneously one can obtain square-exponential decay (i.e. 
subgaussian type decay). 



l *Note that the bound (2.8) is trivial for |A| ^> ^fn, so wc may assume that A is 
small compared to this quantity. 
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By using Stirling's formula (see Exercise 1.2.2) one can show that 
the quadratic decay in (2.8) cannot be improved; see Exercise 2.1.2 
below. 

It was a little complicated to manage such large moments E|5„| fc . 
A slicker way to proceed (but one which exploits the joint indepen- 
dence and commutativity more strongly) is to work instead with the 
exponential moments Eexp(tS n ), which can be viewed as a sort of 
generating function for the power moments. A useful lemma in this 
regard is 

Lemma 2.1.2 (Hocff ding's lemma). Let X be a scalar variable taking 
values in an interval [a, b] . Then for any t > 0, 

(2.9) Ee* x < e tEX (l + 0(t 2 Var(X) exp(0(t(6 - a)))). 
In particular 

(2.10) Ee tx < e tEX cxp(0(t 2 (6 - a) 2 )). 

Proof. It suffices to prove the first inequality, as the second then fol- 
lows using the bound Var(X) < (6 — a) 2 and from various elementary 
estimates. 

By subtracting the mean from X, a, b we may normalise E(X) = 
0. By dividing X, a, b (and multiplying t to balance) we may assume 
that b — a = 1, which implies that X = 0(1). We then have the 
Taylor expansion 

e tx = 1 + tX + 0(t 2 X 2 exp(0(t))) 
which on taking expectations gives 

Ee tx = 1 + 0(t 2 Var(X) exp(0(t)) 

and the claim follows. □ 

Exercise 2.1.1. Show that the 0{t 2 (b- a) 2 ) factor in (2.10) can be 
replaced with t 2 (b — a) 2 /8, and that this is sharp. (Hint: use Jensen's 
inequality, Exercise 1.1.8.) 

We now have the fundamental Chernoff bound: 
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Theorem 2.1.3 (Chcrnoff inequality). Let X\, . . . ,X n be indepen- 
dent scalar random variables with \Xi\ < K almost surely, with mean 
[Li and variance of . Then for any A > 0, one has 

(2.11) P(\S n -(j,\ > Act) < Cmax(cxp(-cA 2 ),cxp(-cAcr/iO) 

for some absolute constants C, c > 0, where [i := Y^i=i A 4 * an d '■— 

Proof. By taking real and imaginary parts we may assume that the 
Xi are real. By subtracting off the mean (and adjusting K appropri- 
ately) we may assume that [n = (and so ^ = 0); dividing the Xi 
(and Oi) through by K we may assume that K = 1. By symmetry it 
then suffices to establish the upper tail estimate 

P(SVi > Act) < Cmax(exp(— cA 2 ), exp(— cAer)) 

(with slightly different constants C, c) . 

To do this, we shall first compute the exponential moments 

Eexp(iS*„) 

where < t < 1 is a real parameter to be optimised later. Expand- 
ing out the exponential and using the independence hypothesis, we 
conclude that 

n 

Eexp(tS„) = J|Eexp(iXi). 

i=l 

To compute Eexp(iX), we use the hypothesis that |X| < 1 and (2.9) 
to obtain 

Eexp(OT) < cxp(0(i 2 er 2 )). 

Thus we have 

Eexp(tS n ) = cxp(0(i 2 cr 2 )) 

and thus by Markov's inequality (1.13) 

P(S„ > Act) < exp(0(i 2 cr 2 ) - t\a). 

If we optimise this in t, subject to the constraint < t < 1, we obtain 
the claim. □ 

Informally, the Chernoff inequality asserts that S n is sharply con- 
centrated in the range nfi+0(a^/n). The bounds here are fairly sharp, 
at least when A is not too large: 
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Exercise 2.1.2. Let < p < 1/2 be fixed independently of n, and let 
X\ , . . . , X n be iid copies of a Bernoulli random variable that equals 
1 with probability p, thus \ii = p and of = p(l — p), and so /i = np 
and a 2 = np(l —p). Using Stirling's formula (Section 1.2), show that 

P(\S n -n\> Act) > cexp(-CA 2 ) 

for some absolute constants C, c > and all A < co. What happens 
when A is much larger than er? 

Exercise 2.1.3. Show that the term exp(— cXa/K) in (2.11) can 
be replaced with {XKja)^ cXa l K (which is superior when XK 3> er). 
{Hint: Allow t to exceed 1.) Compare this with the results of Exercise 
2.1.2. 

Exercise 2.1.4 (Hocffding's inequality). Let X\, . . . , X n be indepen- 
dent real variables, with Xi taking values in an interval [aj,i>j], and 
let S n := Xi H + X n . Show that one has 

P(|5„ - ES„| > Act) < Cexp(-cA 2 ) 

for some absolute constants C, c > 0, where a 2 :— Y^i=i — a i\ 2 ■ 

Remark 2.1.4. As we can see, the exponential moment method is 
very slick compared to the power moment method. Unfortunately, 
due to its reliance on the identity e x+Y = e x e Y , this method relies 
very strongly on commutativity of the underlying variables, and as 
such will not be as useful when dealing with noncommutative random 
variables, and in particular with random matrices 4 . Nevertheless, we 
will still be able to apply the Chernoff bound to good effect to various 
components of random matrices, such as rows or columns of such 
matrices. 

The full assumption of joint independence is not completely nec- 
essary for Chernoff- type bounds to be present. It suffices to have a 
martingale difference sequence, in which each Xi can depend on the 
preceding variables X\,. . . but which always has mean zero 

even when the preceding variables are conditioned out. More pre- 
cisely we have Azuma's inequality: 



See however Section 3.2 for a partial resolution of this issue. 
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Theorem 2.1.5 (Azuma's inequality). Let X\, . . . , X n be a sequence 
of scalar random variables with \Xi\ < 1 almost surely. Assume also 
that we have 5 the martingale difference property 

(2.12) E(X i \X 1 ,...,X i _ 1 ) = 

almost surely for all i = l,...,n. Then for any A > 0, the sum 
S n := Xi H + X n obeys the large deviation inequality 

(2.13) P(|S„| > Xy/n) < Ccxp(-cA 2 ) 
for some absolute constants C,c> 0. 

A typical example of S n here is a dependent random walk, in 
which the magnitude and probabilities of the i th step are allowed to 
depend on the outcome of the preceding i — 1 steps, but where the 
mean of each step is always fixed to be zero. 

Proof. Again, we can reduce to the case when the Xi are real, and 
it suffices to establish the upper tail estimate 

P(S n > ^Vn) < Ccxp(-cA 2 ). 

Note that \S n \ < n almost surely, so we may assume without loss of 
generality that A < ^/n. 

Once again, we consider the exponential moment Ecxp(tS' Il ) for 
some parameter t > 0. We write S n = S n -i + X n , so that 

Ecxp(t5„) = Eexp(iSVi-i) exp(tX n ). 

We do not have independence between S n -i and X n , so cannot split 
the expectation as in the proof of Chernoff 's inequality. Nevertheless 
we can use conditional expectation as a substitute. We can rewrite 
the above expression as 

EE(exp(i5„_ 1 ) cxp^A^lAx, . . . , X„_i). 

The quantity S n -i is deterministic once we condition on X\, . . . , X n _i, 
and so we can pull it out of the conditional expectation: 

Eexp(t5„_i)E(exp(iX„)|X 1 , . . .,X n _i). 

^Hcrc wc assume the existence of a suitable disintegration in order to define 
the conditional expectation, though in fact it is possible to state and prove Azuma's 
inequality without this disintegration. 
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Applying (2.10) to the conditional expectation, we have 
E(exp(iX„)|X 1; . . . ,X„_i) < cxp(0(t 2 )) 

and 

Eexp(tS„) < exp(0(i 2 ))Eexp(t£ n _i). 
Iterating this argument gives 

Eexp(tS„) < exp((9(nt 2 )) 
and thus by Markov's inequality (1.13) 

P(S n > Xy/n) < exp((3(nt 2 ) - tXy/n). 

Optimising in t gives the claim. □ 

Exercise 2.1.5. Suppose we replace the hypothesis \Xi\ < 1 in 
Azuma's inequality with the more general hypothesis \Xi\ < Cj for 
some scalars a > 0. Show that we still have (2.13), but with yfn 
replaced by (ELiC 2 ) 1/2 . 

Remark 2.1.6. The exponential moment method is also used fre- 
quently in harmonic analysis to deal with lacunary exponential sums, 
or sums involving Radamachcr functions (which are the analogue of 
lacunary exponential sums for characteristic 2). Examples here in- 
clude Khintchine's inequality (and the closely related Kahane's in- 
equality); see e.g. [Wo2003], [Kal985]. The exponential moment 
method also combines very well with log-Sobolev inequalities, as we 
shall see below (basically because the logarithm inverts the exponen- 
tial), as well as with the closely related hypercontractivity inequalities. 

2.1.2. The truncation method. To summarise the discussion so 
far, we have identified a number of large deviation inequalities to 
control a sum S n — X\ -(-••• -(- X n '. 

(i) The zeroth moment method bound (2.1), which requires no 
moment assumptions on the X 4 but is only useful when Xi 
is usually zero, and has no decay in A. 

(ii) The first moment method bound (2.2), which only requires 
absolute integrability on the Xi, but has only a linear decay 
in A. 
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(iii) The second moment method bound (2.5), which requires 
second moment and pairwise independence bounds on Xi, 
and gives a quadratic decay in A. 

(iv) Higher moment bounds (2.7), which require boundedness 
and fc-wise independence, and give a fc th power decay in A 
(or quadratic-exponential decay, after optimising in k). 

(v) Exponential moment bounds such as (2.11) or (2.13), which 
require boundedness and joint independence (or martingale 
behaviour), and give quadratic-exponential decay in A. 

We thus see that the bounds with the strongest decay in A require 
strong boundedness and independence hypotheses. However, one can 
often partially extend these strong results from the case of bounded 
random variables to that of unbounded random variables (provided 
one still has sufficient control on the decay of these variables) by 
a simple but fundamental trick, known as the truncation method. 
The basic idea here is to take each random variable Xi and split it 
as Xi = Xi,<N + Xi t> N, where N is a truncation parameter to be 
optimised later (possibly in manner depending on n), 

X it < N := XiI(\Xi\ < N) 

is the restriction of Xi to the event that \Xi\ < N (thus Xi,<N van- 
ishes when Xi is too large), and 

X t>N := XiI(\Xi\ > N) 

is the complementary event. One can similarly split S n = S n ,<N + 
S n ,>N where 

S n ,<N = Xi t <N + • • • + X, h <]y 

and 

S n ,>N — Xi y> N + ' ' ' + ^n,>Af- 

The idea is then to estimate the tail of S n ,<N and SV^jv by two 
different means. With S n ,<N, the point is that the variables X iy <N 
have been made bounded by fiat, and so the more powerful large 
deviation inequalities can now be put into play. With 5„.>tv, in 
contrast, the underlying variables X^>tv are certainly not bounded, 



78 



2. Random matrices 



but they tend to have small zeroth and first moments, and so the 
bounds based on those moment methods tend to be powerful here 6 . 

Let us begin with a simple application of this method. 

Theorem 2.1.7 (Weak law of large numbers). Let Xi, X 2 , ■ ■ ■ be iid 

scalar random variables with X, L = X for all i, where X is absolutely 
integrable. Then S n /n converges in probability to EX. 

Proof. By subtracting EX from X we may assume without loss of 
generality that X has mean zero. Our task is then to show that 
P{\S n \ > en) = o(l) for all fixed e > 0. 

If X has finite variance, then the claim follows from (2.5). If 
X has infinite variance, we cannot apply (2.5) directly, but we may 
perform the truncation method as follows. Let iVbea large parameter 
to be chosen later, and split Xi = Xi t <N+Xi t> N, S n = S n ,<N+S nt> N 
(and X = X<n + Xyjy) as discussed above. The variable X<jy is 
bounded and thus has bounded variance; also, from the dominated 
convergence theorem we see that |EX<at| < e/4 (say) if N is large 
enough. From (2.5) we conclude that 



(where the rate of decay here depends on N and e). Meanwhile, to 
deal with the tail X>at we use (2.2) to conclude that 



But by the dominated convergence theorem (or monotone convergence 
theorem), we may make E|-X>jv| as small as we please (say, smaller 
than 6 > 0) by taking TV large enough. Summing, we conclude that 



P(|S„,<jv| >en/2) 



o(l) 



P(l 



S n , >N \>en/2) < -E\X >N \. 



P(\S n \>en) = -5 + o(l); 
since 6 is arbitrary, we obtain the claim. 



□ 



A more sophisticated variant of this argument 7 gives 



Readers who arc familiar with harmonic analysis may recognise this type of 
"divide and conquer argument" as an interpolation argument; see [Ta2010, §1.11]. 
^Scc [Ta2009, §1.4] for a more detailed discussion of this argument. 
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Theorem 2.1.8 (Strong law of large numbers). Let Ai, A 2 , . . . be iid 

scalar random variables with Xi = X for all i, where X is absolutely 
integrable. Then S n /n converges almost surely to EX. 

Proof. We may assume without loss of generality that X is real, 
since the complex case then follows by splitting into real and imagi- 
nary parts. By splitting X into positive and negative parts, we may 
furthermore assume that X is non- negative 8 . In particular, S n is now 
non-decreasing in n. 

Next, we apply a sparsification trick. Let < e < 1. Suppose that 
we knew that, almost surely, S nm /n m converged to EX for n = n m 
of the form n m :— [(1 + e) m \ for some integer to. Then, for all other 
values of n, we see that asymptotically, S n /n can only fluctuate by a 
multiplicative factor of 1 + 0(e), thanks to the monotone nature of 
S n . Because of this and countable additivity, we see that it suffices 
to show that S nm /n m converges to ~EX. Actually, it will be enough 
to show that almost surely, one has \S nm /n m — EA| < e for all but 
finitely many to. 

Fix e. As before, we split X = X > ^ m + A<jv m and S nm = 
Sn m .>N m + S nm ,<N ml but with the twist that we now allow N = N m 
to depend on to. Then for N m large enough we have |EX<7v m — EX| < 
e/2 (say), by dominated convergence. Applying (2.5) as before, we 
see that 

P(\S nm ,<N m /n m - EA| > e) < -^E|A< Wm | 2 

for some C e depending only on e (the exact value is not important 
here). To handle the tail, we will not use the first moment bound 
(2.2) as done previously, but now turn to the zcroth-moment bound 
(2.1) to obtain 

P(Sn m ,>N m ^ 0) < n m P(|A| > N m ); 
summing, we conclude 

P(\S nm /n m -EX\ >e) < °^E\X< Nm \ 2 + n m P(\X\ > N m ). 

71m 



Of course, by doing so, we can no longer normalise X to have mean zero, but 
for us the non-negativity will be more convenient than the zero mean property. 
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Applying the Borel-Cantelli lemma (Exercise 1.1.1), we see that we 
will be done as long as we can choose N m such that 



m=l nm 

and 

oo 

n m P(\X\ > N m ) 

m—l 

are both finite. But this can be accomplished by setting N m := n m 
and interchanging the sum and expectations (writing P(|A| > N m ) 
as EI(|X| > N m )) and using the lacunary nature of the n m (which in 
particular shows that T, m :n m <x n m = 0(X) and Y, m :n m >x n m = 
0{X- r ) for any X > 0). □ 

To give another illustration of the truncation method, we extend 
a version of the Chernoff bound to the subgaussian case. 

Proposition 2.1.9. Let X\, . . . ,X n = X be iid copies of a subgaus- 
sian random variable X, thus X obeys a bound of the form 

(2.14) P(|X| >t)< Cexp(-ct 2 ) 

for all t > and some C, c > 0. Let S n := Xi + ■ ■ ■ + X n . Then for 
any sufficiently large A ( independent of n) we have 

V(\S n - nEX\ > An) < C A cxp(-c A n) 

for some constants Ca,ca depending on A,C,c. Furthermore, ca 
grows linearly in A as A — > oo. 

Proof. By subtracting the mean from X we may normalise EX = 0. 
We perform a dyadic decomposition 



Xi — X iy0 + Xi. 



m—l 



where X ifi := X l I(X l < 1) and X hm := X,I(2 m - 1 < X t < 2 m ). We 
similarly split 

oo 

S n = S n fi + ^ ' S n , m 
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where S n . m — Y^i=i Xi,m- Then by the union bound and the pigeon- 
hole principle we have 

P(\S n \ > An) < £ P [lS n , m \ > 1Q0( ^ TIF n) 

(say). Each Xj jm is clearly bounded in magnitude by 2 m ; from the 
subgaussian hypothesis one can also verify that the mean and variance 
of X i<m are at most C exp(-c'2 2m ) for some C',d > 0. If A is 
large enough, an application of the Chernoff bound (2.11) (or more 
precisely, the refinement in Exercise 2.1.3) then gives (after some 
computation) 

P(|S„ ;m | > 2- m - 1 An) < C'2- m cxp(-c'An) 

(say) for some C", d > 0, and the claim follows. □ 

Exercise 2.1.6. Show that the hypothesis that A is sufficiently large 
can be replaced by the hypothesis that A > is independent of n. 
Hint: There are several approaches available. One can adapt the 
above proof; one can modify the proof of the Chernoff inequality 
directly; or one can figure out a way to deduce the small A case from 
the large A case. 

Exercise 2.1.7. Show that the subgaussian hypothesis can be gen- 
eralised to a sub-exponential tail hypothesis 

P(\X\ >t)< Ccxp(-ct p ) 

provided that p > 1. Show that the result also extends to the 
case < p < 1, except with the exponent exp(— c^n) replaced by 
cxp(— CAn p ~ e ) for some e > 0. (I do not know if the e loss can be 
removed, but it is easy to see that one cannot hope to do much better 
than this, just by considering the probability that X\ (say) is already 
as large as An.) 

2.1.3. Lipschitz combinations. In the preceding discussion, we 
had only considered the linear combination X\ , . . . , X n of indepen- 
dent variables X\, . . . ,X n . Now we consider more general combina- 
tions F(X), where we write X (Xi, . . . ,X n ) for short. Of course, 
to get any non-trivial results we must make some regularity hypothe- 
ses on F. It turns out that a particularly useful class of regularity 
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hypothesis here is a Lipschitz hypothesis - that small variations in 
X lead to small variations in F(X). A simple example of this is 
McDiarmid's inequality: 

Theorem 2.1.10 (McDiarmid's inequality). Let X\, . . . , X n be in- 
dependent random variables taking values in ranges R\, . . . , R n , and 
let F : Ri x . . . x R n — >• C be a function with the property that if one 
freezes all but the i th coordinate of F(x\, . . . , x n ) for some 1 < i < n, 
then F only fluctuates by most Ci > 0, thus 

\F(xi, . . . , Xi-i,Xi, x i+ i, . . . , x n )- 

F(X \ , . . . , Xi— 1 , , X^4-l , . . . , X n ^j I ^ Cj 

for all Xj G Xj, x[ G Xi for 1 < j < n. Then for any A > 0, one has 

P(\F(X) - EF(X)\ > Act) < Cexp(-cA 2 ) 
for some absolute constants C, c > 0, where a 2 := Y^i=i c i ■ 

Proof. We may assume that F is real. By symmetry, it suffices to 
show the one-sided estimate 

(2.15) P(F(X) - EF(X) > Act 2 ) < Ccxp(-cA 2 ). 

To compute this quantity, we again use the exponential moment 
method. Let t > be a parameter to be chosen later, and consider 
the exponential moment 

(2.16) Eexp(tF(AQ). 

To compute this, let us condition X\, . . . , X n _\ to be fixed, and look 
at the conditional expectation 

E(exp(^(X))|X 1 ,...,X„_ 1 ). 

We can simplify this as 

E(exp(iF)|X l5 . . . ,X n _i) exp(tE(F(X)|X 1 , . . . ,X n _ x )) 

where 

Y :=F(X)-F,(F(X)\X 1 ,...,X n _ 1 ). 

For Xi, . . . ,X n -i fixed, tY only fluctuates by at most tc n and has 
mean zero. Applying (2.10), we conclude that 

E(exp(ty)|Xi, . . . ,X n _i) < exp(0(t 2 c 2 )). 
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Integrating out the conditioning, we see that we have upper bounded 
(2.16) by 

exp(0(t 2 c 2 n ))Eexp(t(E(F (X)\X U X n ^)). 

We observe that {E{F{X)\X\, . . . , X„_i) is a function F n -i(Xi, . . . , X„_i) 
of X\, . . . ,X n -i, where F„_i obeys the same hypotheses as F (but 
for n — 1 instead of n). We can then iterate the above computation 
n times and eventually upper bound (2.16) by 

n 

exp(]TO(t 2 C 2))exp(tE^pO), 

i=l 

which we rearrange as 

Eexp(t(F(X) - EF(X))) < cxp(0(t 2 a 2 )), 

and thus by Markov's inequality (1.13) 

P{F(X) - EF(X) > Act) < cxp(0(t 2 cr 2 ) - tXa). 

Optimising in t then gives the claim. □ 

Exercise 2.1.8. Show that McDiarmid's inequality implies Hocffd- 
ing's inequality (Exercise 2.1.4). 

Remark 2.1.11. One can view McDiarmid's inequality as a tensori- 
sation of Hocffding's lemma, as it leverages the latter lemma for a 
single random variable to establish an analogous result for n random 
variables. It is possible to apply this tensorisation trick to random 
variables taking values in more sophisticated metric spaces than an 
interval [a, b] , leading to a class of concentration of measure inequali- 
ties known as transportation cost-information inequalities, which will 
not be discussed here. 

The most powerful concentration of measure results, though, do 
not just exploit Lipschitz type behaviour in each individual variable, 
but joint Lipschitz behaviour. Let us first give a classical instance of 
this, in the special case when the X\, . . . , X n arc Gaussian variables. 
A key property of Gaussian variables is that any linear combination 
of independent Gaussians is again an independent Gaussian: 
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Exercise 2.1.9. Let X\, . . . , X n be independent real Gaussian vari- 
ables with Xi = N(ni, ct 2 )r, and let c\,...,c n be real constants. 
Show that C\X\ + . . . + c n X n is a real Gaussian with mean Y^i=i 
and variance X)"=i l c i| 2 °f ■ 

Show that the same claims also hold with complex Gaussians and 
complex constants Cj. 

Exercise 2.1.10 (Rotation invariance). Let X = (X\, . . . , X n ) be an 
R"-valued random variable, where X\, . . . , X n = N(0, 1)r are iid real 
Gaussians. Show that for any orthogonal matrix U € 0(n), UX = X . 

Show that the same claim holds for complex Gaussians (so X is 
now C"-valued), and with the orthogonal group 0(n) replaced by the 
unitary group U(n). 

Theorem 2.1.12 (Gaussian concentration inequality for Lipschitz 
functions). Let X\, . . . , X n = N(0, 1)r be iid real Gaussian variables, 
and let F : R™ H be a 1-Lipschitz function (i.e. \F(x) — F(y)\ < 
\x — y\ for all x,y e R n , where we use the Euclidean metric on R™J. 
Then for any A one has 

P(\F(X) - EF(X)\ > A) < Ccxp(-cA 2 ) 

for some absolute constants C,c> 0. 

Proof. We use the following elegant argument of Maurey and Pisier. 
By subtracting a constant from F, we may normalise ~EF(X) = 0. 
By symmetry it then suffices to show the upper tail estimate 

P(F(X) > A) < Ccxp(-cA 2 ). 

By smoothing F slightly we may assume that F is smooth, since the 
general case then follows from a limiting argument. In particular, the 
Lipschitz bound on F now implies the gradient estimate 

(2.17) \VF{x)\ < 1 

for all x e R". 

Once again, we use the exponential moment method. It will suf- 
fice to show that 

Eexp(i J F(X)) < cxp(Ci 2 ) 
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for some constant C > and all t > 0, as the claim follows from 
Markov's inequality (1.13) and optimisation in t as in previous argu- 
ments. 

To exploit the Lipschitz nature of F, we will need to introduce 
a second copy of F(X). Let Y be an independent copy of X. Since 
EF(Y) = 0, we see from Jensen's inequality (Exercise 1.1.8) that 

Eexp(-iF(F)) > 1 

and thus (by independence of X and Y) 

Ecxp(tF(X)) < Eexp(i(F(X) - F(Y))). 

It is tempting to use the fundamental theorem of calculus along a line 
segment, 

F(X) - F(Y) = jf * j t F((l - t)Y + tX) dt, 

to estimate F{X) — F(Y), but it turns out for technical reasons to be 
better to use a circular arc instead, 

r /2 d 

F(X)-F(Y)= — F(Ycos6 + Xsm6) dff, 

Jo de 

The reason for this is that Xg := Y cosd + X sin 9 is another Gaussian 
random variable equivalent to X, as is its derivative X' g := — Ysm6 + 
X cos9 (by Exercise 2.1.9); furthermore, and crucially, these two ran- 
dom variables are independent (by Exercise 2.1.10). 

To exploit this, we first use Jensen's inequality (Exercise 1.1.8) 
to bound 

2 I"*/ 2 /It d \ 
exp(t(F(X) - F(Y))) < - ^ exp ^--F(X e )j d9. 

Applying the chain rule and taking expectations, we have 

2 l"*! 2 (It \ 

Eexp(t(F(X)-F(Y)))< - J Eexp l^-VF(X e ) ■ X' g J d0. 

Let us condition Xg to be fixed, then X' g = X; applying Exercise 2.1.9 
and (2.17), we conclude that 2 ^S7F(Xg) ■ X' g is normally distributed 
with standard deviation at most — . As such we have 

Eexp (^-VF(Xg) ■ X'g^j < cxp(Ct 2 ) 
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for some absolute constant C ; integrating out the conditioning on Xg 
we obtain the claim. □ 

Exercise 2.1.11. Show that Theorem 2.1.12 is equivalent to the 
inequality 

P(X e A)P(X A x ) < Ccxp(-cA 2 ) 

holding for all A > and all measurable sets A, where X = {X\, . . . , X n ) 

is an R"-valued random variable with iid Gaussian components X\ , . . . , X n = 

N(0, 1)r, and A\ is the A-neighbourhood of A. 

Now we give a powerful concentration inequality of Talagrand, 
which we will rely heavily on later in this text. 

Theorem 2.1.13 (Talagrand concentration inequality). Let K > 

0, and let X\, . . . ,X n be independent complex variables with \X^\ < 
K for all 1 < i < n. Let F : C™ — > R be a l-Lipschitz convex 
function ( where we identify C" with R 2 ™ for the purposes of defining 
"Lipschitz" and "convex"). Then for any A one has 

(2.18) P(\F(X) - MF(X)\ > \K) < Ccxp(-cA 2 ) 
and 

(2.19) P(\F(X) - EF(X)\ > \K) < Ccxp(-cA 2 ) 

for some absolute constants C,c > 0, where M.F(X) is a median of 
F(X). 

We now prove the theorem, following the remarkable argument 
of Talagrand[Tal995]. 

By dividing through by K we may normalise K = 1. X now 
takes values in the convex set O™ c C™, where is the unit disk in 
C. It will suffice to establish the inequality 

(2.20) E cxp(cd(X, Af) < ^ - 

for any convex set A in f2 n and some absolute constant c > 0, where 
d(X, A) is the Euclidean distance between X and A. Indeed, if one 
obtains this estimate, then one has 

P{F(X) < x)P(F{X) >y)< exp(-c|a; - y\ 2 ) 



2.1. Concentration of measure 



87 



for any y > x (as can be seen by applying (2.20) to the convex set 
A := {z G Q n : F(z) < x}). Applying this inequality of one of x, y 
equal to the median M.F(X) of F(X) yields (2.18), which in turn 
implies that 



which then gives (2.19). 

We would like to establish (2.20) by induction on dimension n. In 
the case when X\, . . . ,X n are Bernoulli variables, this can be done; 
see [Ta2010b, §1.5]. In the general case, it turns out that in order 
to close the induction properly, one must strengthen (2.20) by replac- 
ing the Euclidean distance d(X, A) by an essentially larger quantity, 
which I will call the combinatorial distance d c (X, A) from X to A. 
For each vector z = (zi, . . . , z n ) G C and lo = (lo\, . . . ,LO n ) G {0, 1}™, 
we say that lo supports z if z% is non-zero only when uit is non-zero. 
Define the combinatorial support Ua(X) of A relative to X to be all 
the vectors in {0, 1}™ that support at least one vector in A — X. De- 
fine the combinatorial hull Va(X) of A relative to X to be the convex 
hull of Ua(X), and then define the combinatorial distance d c {X,A) 
to be the distance between Va(X) and the origin. 

Lemma 2.1.14 (Combinatorial distance controls Euclidean distance). 
Let A be a convex subset of O" . Then d(X,A) < 2d c (X,A). 

Proof. Suppose d c (X, A) < r. Then there exists a convex combi- 
nation t — (ti, . . . , t n ) of elements lo G Ua(X) C {0, 1}" which has 
magnitude at most r. For each such lo G Ua(X), we can find a vector 
z u G X — A supported by lo. As A, X both lie in fi", every coefficient 
of z w has magnitude at most 2, and is thus bounded in magnitude 
by twice the corresponding coefficient of lo. If we then let z t be the 
convex combination of the z u indicated by t, then the magnitude of 
each coefficient of z t is bounded by twice the corresponding coefficient 
of t, and so \z t \ < 2r. On the other hand, as A is convex, z t lies in 
X — A, and so d(X, A) < 2r. The claim follows. □ 

Thus to show (2.20) it suffices (after a modification of the con- 
stant c) to show that 



EF(X) = MF(X) + 0(1) 



(2.21) 



Eexp(cd c (X,A) 2 ) < 



1 



P(A G A) ' 



88 



2. Random matrices 



We first verify the one-dimensional case. In this case, d c (X, A) 
equals 1 when X £ A, and otherwise, and the claim follows from 
elementary calculus (for c small enough). 

Now suppose that n > 1 and the claim has already been proven 
for n - 1. We write X = {X',X n ), and let A Xn := {z' e : 
(z',X n ) e A} be a slice of A. We also let B := {z 1 e ft"" 1 : (z 7 ,*) G 
A for some fgfl}. We have the following basic inequality: 

Lemma 2.1.15. For any < A < 1, we have 

d c (X, A) 2 < (1 - A) 2 + Xd c (X', A Xn ) 2 + (1 - A)d c (X', Bf. 

Proof. Observe that f7 A pO contains both U Axn (X')x{0} and f7 s (X')x 
{1}, and so by convexity, Va{X) contains (At + (1 — \)u, 1 — A) when- 
ever t £ Va x (X 1 ) and u e Vg(^')- The claim then follows from 
Pythagoras' theorem and the Cauchy-Schwarz inequality. □ 

Let us now freeze X n and consider the conditional expectation 

E(cxp(cd c (X,A) 2 )\X n ). 

Using the above lemma (with some A depending on X n to be chosen 
later), we may bound the left-hand side of (2.21) by 

e c ( 1 - A ) 2 E(( e cd ^ x '" 4 -") 2 ) A (e c ^( x '' B ) 2 ) 1 - A |X„); 

applying Holder's inequality and the induction hypothesis (2.21), we 
can bound this by 

e c(l-A) 2 1 

P(x> e A Xn \x n )*P(X' e B\x n y-* 

which we can rearrange as 

1 r c(l-X) 2 -X 

p(x< e B) 

where r := P(X' e 4 X JX„)/P(X' e B) (here we note that the 
event X' e B is independent of Note that < r < 1. We then 
apply the elementary inequality 

inf e c(1 ~ A) V A < 2-r, 

0<A<1 
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which can be verified by elementary calculus if c is small enough (in 
fact one can take c = 1/4) . We conclude that 

Taking expectations in n we conclude that 

E(exp W Z,^,)< ? ^(2-|£|^). 

Using the inequality x(2 — x) < 1 with x := p^x^eB) we conclude 

(2.21) as desired. 

The above argument was elementary, but rather "magical" in na- 
ture. Let us now give a somewhat different argument of Ledoux[Lel995] , 
based on log-Sobolev inequalities, which gives the upper tail bound 

(2.22) P{F(X) - EF{X) > XK) < Ccxp(-cA 2 ), 

but curiously does not give the lower tail bound 9 . 

Once again we can normalise K = 1. By regularising F we may 
assume that F is smooth. The first step is to establish the following 
log-Sobolev inequality: 

Lemma 2.1.16 (Log-Sobolev inequality). Let F : C™ — > R be a 

smooth convex function. Then 

EF(X)e F ^ < (Ee^ x ))(logEe F W) + CEe F ^\V F(X)\ 2 
for some absolute constant C (independent of n). 

Remark 2.1.17. If one sets / := e F l 2 and normalises E/ (X) 2 = 1, 
this inequality becomes 

E\f(X)\ 2 \og\f(X)\ 2 <ACE\\7f(X)\ 2 

which more closely resembles the classical log-Sobolev inequality (see 
[Grl975] or [Fel969]). The constant C here can in fact be taken to 
be 2; see [Lel995]. 



The situation is not symmetric, due to the convexity hypothesis on F. 
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Proof. We first establish the 1-dimensional case. If we let Y be 
an independent copy of X, observe that the left-hand side can be 
rewritten as 

^E((F(X) - F(Y))(e F ^ - e F ^)) + (EF{X))((Ee F ^). 

From Jensen's inequality (Exercise 1.1.8), EF(X) < logEe F ^ x \ so it 
will suffice to show that 

E((F(X) - F{Y))(e F ^ - e F ^)) < 2CVe F ^\V F(X)\ 2 . 

From convexity of F (and hence of e F ) and the bounded nature of 
X, Y, we have 

F(X)-F(Y) = 0(\VF(X)\) 

and 

e F(x)_ e F(Y) =0 (\VF(X)\e F W) 

when F(X) > F(Y), which leads to 

{(F(X) - F(Y))(e F(x ) - e F(y ">)) = 0(e F ^\VF(X)\ 2 ) 

in this case. Similarly when F(X) < F(Y) (swapping X and Y). The 
claim follows. 

To show the general case, we induct on n (keeping care to en- 
sure that the constant C does not change in this induction process). 
Write X = (X',X n ), where X' := (X\, . . . ,X n _i). From induction 
hypothesis, we have 

E(F(X)e F ^\X n ) < f(X n )e f ^ + CV(e F{X) \VF(X)\ 2 \X n ) 

where V is the n— 1-dimensional gradient and f(X n ) := log E(e F ( x ) \X n ). 
Taking expectations, we conclude that 

(2.23) EF(X)e F ^ < E/(X„)e /(x " ) + CVe F{x) \V F{X)\ 2 . 

From the convexity of F and Holder's inequality we see that / is 
also convex, and Ee-^ x ") = JZe F ( x \ By the n = 1 case already 
established, we have 

(2.24) Ef(X n )e f ^ < (Ee F W)(logEe F W)+CEe^)|/'(X„)| 2 . 
Now, by the chain rule 

ef(^\f(X n )\ 2 = e-^\Ee F ^F Xn (X)\ 2 
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where F Xn is the derivative of F in the x n direction (where (xi, . . . , x n ) 
are the usual coordinates for R"). Applying Cauchy-Schwarz, we con- 
clude 

e /(*»)|/'(X„)| 2 <Ee F W|F Xn (X)| 2 . 
Inserting this into (2.23), (2.24) we close the induction. □ 

Now let F be convex and 1-Lipschitz. Applying the above lemma 
to tF for any t > 0, we conclude that 

EtF(A)e' F W < (Ee< F W)(logEe tF W) + Ci 2 Ee tF «; 

setting H(t) :— Ee tF ( x \ we can rewrite this as a differential inequal- 
ity 

tH'(t) < H(t)\ogH(t) + Ct 2 H{t) 
which we can rewrite as 

^io g ^))<a 

From Taylor expansion we see that 

^logff(i) -> EF(X) 

as t — > 0, and thus 

* log#(t) < EF(A) + Ct 
for any t > 0. In other words, 

Ee* F(x) < cxp(tEF(X) + Ct 2 ). 
By Markov's inequality (1.13), we conclude that 

P(F(X) - EF(X) > A) < exp(Ct 2 - t\); 
optimising in t gives (2.22). 

Remark 2.1.18. The same argument, starting with Gross's log- 
Sobolev inequality for the Gaussian measure, gives the upper tail 
component of Theorem 2.1.12, with no convexity hypothesis on F. 
The situation is now symmetric with respect to reflections F n- — F, 
and so one obtains the lower tail component as well. The method of 
obtaining concentration inequalities from log-Sobolev inequalities (or 
related inequalities, such as Poincare-type inequalities) by combining 
the latter with the exponential moment method is known as Herbst 's 
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argument, and can be used to establish a number of other functional 
inequalities of interest. 

We now close with a simple corollary of the Talagrand concen- 
tration inequality, which will be extremely useful in the sequel. 

Corollary 2.1.19 (Distance between random vector and a subspace). 
Let Xi, . . . , X n be independent complex- valued random variables with 
mean zero and variance 1, and bounded almost surely in magnitude by 
K. Let V be a subspace of C of dimension d. Then for any A > 0, 
one has 

P(\d(X, V) - ^fn~d\ >XK)<C cxp(-cA 2 ) 
for some absolute constants C,c> 0. 

Informally, this corollary asserts that the distance between a ran- 
dom vector X and an arbitrary subspace V is typically equal to 
v /n-dim(y) + 0(l). 

Proof. The function z ^ d(z, V) is convex and 1-Lipschitz. From 
Theorem 2.1.13, one has 

P(\d(X, V) - Md{X, V)\ > XK) < Ccxp(-cA 2 ). 

To finish the argument, it then suffices to show that 

Md(X, V) = VrT^rf + O(K). 

We begin with a second moment calculation. Observe that 

d(x,v) 2 = Mx)\\ 2 = J2 PaXiXj, 

where tt is the orthogonal projection matrix to the complement V 1 - of 
V, and pij are the components of tt. Taking expectations, we obtain 

n 

(2.25) Ed(X, V) 2 = ^2pu = ti(n) = n — d 

i=l 

where the latter follows by representing ir in terms of an orthonormal 
basis of V- 1 . This is close to what we need, but to finish the task we 
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need to obtain some concentration of d(X, V) 2 around its mean. For 
this, we write 

d(X, Vf - Ed{X, Vf = PijiXiXj ~ Sij) 

where 5ij is the Kronecker delta. The summands here are pairwise 
uncorrected for 1 < i < j < n, and the i > j cases can be combined 
with the i < j cases by symmetry. They are also pairwise independent 
(and hence pairwise uncorrelated) for 1 < i = j < n. Each summand 
also has a variance of 0{K 2 ). We thus have the variance bound 

V a r(d(X,Vf) = 0(K 2 Y \ PlJ \ 2 )+0{K 2 £ \ Pll \ 2 ) = 0(K 2 (n- 

l<i<j<n l<i<n 

where the latter bound comes from representing ir in terms of an 
orthonormal basis of V^. From this, (2.25), and Chebyshev's in- 
equality (1.26), we see that the median of d(X, V) 2 is equal to n — 
d + 0(y/K 2 (n — d)), which implies on taking square roots that the 
median of d(X, V) is \/n — d + O(K), as desired. □ 

2.2. The central limit theorem 

Consider the sum S n := X\ + • • • + X n of iid real random variables 
X\ , . . . , X n = X of finite mean fj, and variance a 2 for some a > 
0. Then the sum S n has mean nfi and variance na 2 , and so (by 
Chebyshev's inequality (1.26)) we expect S n to usually have size n/i + 
0(y/na). To put it another way, if we consider the normalised sum 

(2.26) Z n := 

then Z n has been normalised to have mean zero and variance 1, and 
is thus usually of size O(l). 

In Section 2.1, we were able to establish various tail bounds on 
Z n . For instance, from Chebyshev's inequality (1.26) one has 

(2.27) P(|Z„|>A)<A- 2 , 

and if the original distribution X was bounded or subgaussian, we 
had the much stronger Chernoff bound 

(2.28) P(|Z„| > A) < Ccxp(-cA 2 ) 
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for some absolute constants C,c > 0; in other words, the Z n are 
uniformly subgaussian. 

Now we look at the distribution of Z n . The fundamental central 
limit theorem tells us the asymptotic behaviour of this distribution: 

Theorem 2.2.1 (Central limit theorem). Let X\, . . . , X n = X be 

iid real random variables of finite mean and variance a 2 for some 
a > 0, and let Z n be the normalised sum (2.26). Then as n — > 
co, Z n converges in distribution to the standard normal distribution 
N(0,1) R . 

Exercise 2.2.1. Show that Z n does not converge in probability or 
in the almost sure sense (in the latter case, we view Xi,X 2l . ■ ■ as an 
infinite sequence of iid random variables). {Hint: the intuition here 
is that for two very different values n\ -C n 2 of n, the quantities Z ni 
and Z n2 are almost independent of each other, since the bulk of the 
sum S n2 is determined by those X n with n > n\. Now make this 
intuition precise.) 

Exercise 2.2.2. Use Stirling's formula (Section 1.2) to verify the 
central limit theorem in the case when X is a Bernoulli distribution, 
taking the values and 1 only. (This is a variant of Exercise 1.2.2 
or Exercise 2.1.2. It is easy to see that once one does this, one can 
rescale and handle any other two-valued distribution also.) 

Exercise 2.2.3. Use Exercise 2.1.9 to verify the central limit theorem 
in the case when X is Gaussian. 

Note we are only discussing the case of real iid random variables. 
The case of complex random variables (or more generally, vector- 
valued random variables) is a little bit more complicated, and will be 
discussed later in this section. 

The central limit theorem (and its variants, which we discuss be- 
low) are extremely useful tools in random matrix theory, in particular 
through the control they give on random walks (which arise naturally 
from linear functionals of random matrices). But the central limit 
theorem can also be viewed as a "commutative" analogue of various 
spectral results in random matrix theory (in particular, we shall see in 
later sections that the Wigner semicircle law can be viewed in some 



2.2. The central limit theorem 



95 



sense as a "noncommutative" or "free" version of the central limit the- 
orem). Because of this, the techniques used to prove the central limit 
theorem can often be adapted to be useful in random matrix theory. 
Because of this, we shall use this section to dwell on several different 
proofs of the central limit theorem, as this provides a convenient way 
to showcase some of the basic methods that we will encounter again 
(in a more sophisticated form) when dealing with random matrices. 

2.2.1. Reductions. We first record some simple reductions one can 
make regarding the proof of the central limit theorem. Firstly we 
observe scale invariance: if the central limit theorem holds for one 
random variable X, then it is easy to see that it also holds for aX + b 
for any real a, b with a ^ 0. Because of this, one can normalise to 
the case when X has mean \i = and variance a 2 = 1, in which case 
Z n simplifies to 

X 1 + ■ ■ ■ + X n 



(2.29) Z n 



The other reduction we can make is truncation: to prove the 
central limit theorem for arbitrary random variables X of finite mean 
and variance, it suffices to verify the theorem for bounded random 
variables. To see this, we first need a basic linearity principle: 

Exercise 2.2.4 (Linearity of convergence). Let V be a finite-dimensional 
real or complex vector space, X n ,Y n be sequences of V- valued ran- 
dom variables (not necessarily independent), and let X, Y be another 
pair of V- valued random variables. Let c ni d n be scalars converging 
to c, d respectively. 

(i) If X n converges in distribution to X, and Y n converges in 
distribution to Y, and at least one of X, Y is deterministic, 
show that c n X n +d n Y n converges in distribution to cX+dY. 

(ii) If X n converges in probability to X, and Y n converges in 
probability to Y, show that c n X n + d n Y n converges in prob- 
ability to cX + dY. 

(iii) If X n converges almost surely to X, and Y n converges almost 
surely Y, show that c n X n + d n Y n converges almost surely 
to cX + dY. 
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Show that the first part of the exercise can fail if X, Y are not deter- 
ministic. 

Now suppose that we have established the central limit theorem 
for bounded random variables, and want to extend to the unbounded 
case. Let X be an unbounded random variable, which we can nor- 
malise to have mean zero and unit variance. Let TV = N n > be 
a truncation parameter depending on n which, as usual, we shall 
optimise later, and split X = X<n + X>n in the usual fashion 
(X< N = XI(\X\ < N); X >N = XI(\X\ > iV)). Thus we have 
S n = S n .<N + Sn.yN as usual. 

Let fi<N,a< N be the mean and variance of the bounded random 
variable X<n- As we are assuming that the central limit theorem 
is already true in the bounded case, we know that if we fix A" to be 
independent of n, then 

ry S n .<N — n/j,<N 

^n,<N ■ = 7= 

V ncr <N 

converges in distribution to N(0, 1)r. By a diagonalisation argument, 
we conclude that there exists a sequence N n going (slowly) to infinity 
with n, such that Z n ,<N n still converges in distribution to N(0, 1)r. 

For such a sequence, we see from dominated convergence that 
cr<N n converges to a = 1. As a consequence of this and Exercise 
2.2.4, we see that 

Sn,<N n — n/i<N n 
\/n 

converges in distribution to N(0, 1)r. 

Meanwhile, from dominated convergence again, <r>N n converges 
to 0. From this and (2.27) we see that 

converges in distribution to 0. Finally, from linearity of expectation 
we have ^<w„ + M>w„ = /i = 0. Summing (using Exercise 2.2.4), we 
obtain the claim. 

Remark 2.2.2. The truncation reduction is not needed for some 
proofs of the central limit (notably the Fourier-analytic proof), but 
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is very convenient for some of the other proofs that we will give here, 
and will also be used at several places in later sections. 

By applying the scaling reduction after the truncation reduction, 
we observe that to prove the central limit theorem, it suffices to do so 
for random variables X which are bounded and which have mean zero 
and unit variance. (Why is it important to perform the reductions in 
this order?) 

2.2.2. The Fourier method. Let us now give the standard Fourier- 
analytic proof of the central limit theorem. Given any real random 
variable X, we introduce the characteristic function Fx ■ R — > C, 
defined by the formula 

(2.30) F x (t) := Ee ux . 

Equivalently, F x is the Fourier transform of the probability measure 

Example 2.2.3. The signed Bernoulli distribution has characteristic 
function F(t) = cos(i). 

Exercise 2.2.5. Show that the normal distribution N(/j,, <t 2 )r has 
characteristic function F(t) — e^e"' 7 * I 2 . 

More generally, for a random variable X taking values in a real 
vector space R d , we define the characteristic function F x ■ R d — > C 

by 

(2.31) F x (t) := Ee ipx 

where • denotes the Euclidean inner product on R d . One can similarly 
define the characteristic function on complex vector spaces C d by 
using the complex inner product 

(zi,...,z d ) ■ (w 1 ,...,w d ) := Re(ziwIH h z d w^) 

(or equivalently, by identifying C d with R 2d in the usual manner.) 
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More generally 10 , one can define the characteristic function on 
any finite dimensional real or complex vector space V, by identifying 
V with R d or C d . 

The characteristic function is clearly bounded in magnitude by 1, 
and equals 1 at the origin. By the Lebesgue dominated convergence 
theorem, F x is continuous in t. 

Exercise 2.2.6 (Riemann-Lebesgue lemma). Show that if X is an 
absolutely continuous random variable taking values in R d or C d , 
then Fx(t) — > as t — > oo. Show that the claim can fail when the 
absolute continuity hypothesis is dropped. 

Exercise 2.2.7. Show that the characteristic function Fx of a ran- 
dom variable X taking values in R d or C d is in fact uniformly con- 
tinuous on its domain. 

Let X be a real random variable. If we Taylor expand e ltx 
and formally interchange the series and expectation, we arrive at the 
heuristic identity 



which thus interprets the characteristic function of a real random 
variable X as a kind of generating function for the moments. One 
rigorous version of this identity is as follows. 

Exercise 2.2.8 (Taylor expansion of characteristic function). Let X 
be a real random variable with finite fc th moment for some k > 1. 
Show that Fx is k times continuously differentiable, and one has the 
partial Taylor expansion 



Strictly speaking, one either has to select an inner product on V to do this, 
or else make the characteristic function defined on the dual space V* instead of on 
V itself; see for instance [Ta2010, §1.12]. But we will not need to care about this 
subtlety in our applications. 



(2.32) 




k=0 



F x (t) = J2^fEXi+o(\t\ k ) 



3=0 
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where o(|i| fc ) is a quantity that goes to zero as t — > 0, times \t\ k . In 
particular, we have 



for all < j < k. 

Exercise 2.2.9. Establish (2.32) in the case that X is subgaussian, 
and show that the series converges locally uniformly in t. 

Note that the characteristic function depends only on the distri- 
bution of X: if X = Y, then Fx — Fy. The converse statement is 
true also: if Fx = Fy, then X = Y. This follows from a more general 
(and useful) fact, known as Levy's continuity theorem. 

Theorem 2.2.4 (Levy continuity theorem, special case). Let V be 
a finite- dimensional real or complex vector space, and let X n be a 
sequence of V -valued random variables, and let X be an additional V- 
valued random variable. Then the following statements are equivalent: 

(i) Fx n converges pointwise to Fx ■ 

(ii) X n converges in distribution to X. 

Proof. Without loss of generality we may take V — R d . 

The implication of (i) from (ii) is immediate from (2.31) and the 
definition of convergence in distribution (see Definition 1.1.28), since 
the function x ^ e lt ' x is bounded continuous. 

Now suppose that (i) holds, and we wish to show that (ii) holds. 
By Exercise 1.1.25(iv), it suffices to show that 



whenever ip : V — > R is a continuous, compactly supported function. 
By approximating ip uniformly by Schwartz functions (e.g. using the 
Stone- Weierstrass theorem, see [Ta2010]), it suffices to show this for 
Schwartz functions ip. But then we have the Fourier inversion formula 



dP_ 
dP 



,F x {t) - i J EX j 



E<p(X n ) -+ E^(X) 




where 
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is a Schwartz function, and is in particular absolutely integrable (see 
e.g. [Ta2010, §1.12]). From the Fubini-Tonelli theorem, we thus have 

(2.33) &p(X n ) = / <p(t)F Xn (t) dt 

and similarly for X. The claim now follows from the Lebesgue domi- 
nated convergence theorem. □ 

Remark 2.2.5. Setting X n := Y for all n, we see in particular the 
previous claim that Fx — Fy if and only if X = Y. It is instructive 
to use the above proof as a guide to prove this claim directly. 

Exercise 2.2.10 (Levy's continuity theorem, full version). Let V be 
a finite-dimensional real or complex vector space, and let X n be a 
sequence of F-valued random variables. Suppose that Fx n converges 
pointwise to a limit F. Show that the following are equivalent: 

(i) F is continuous at 0. 

(ii) X n is a tight sequence. 

(iii) F is the characteristic function of a ^-valued random vari- 
able X (possibly after extending the sample space). 

(iv) X n converges in distribution to some ^-valued random vari- 
able X (possibly after extending the sample space). 

Hint: To get from (ii) to the other conclusions, use Prokhorov's the- 
orem (see Exercise 1.1.25) and Theorem 2.2.4. To get back to (ii) 
from (i), use (2.33) for a suitable Schwartz function tp. The other 
implications are easy once Theorem 2.2.4 is in hand. 

Remark 2.2.6. Levy's continuity theorem is very similar in spirit to 
Weyl's criterion in equidistribution theory (see e.g. [KuNi2006]). 

Exercise 2.2.11 (Esseen concentration inequality). Let X be a ran- 
dom variable taking values in R d . Then for any r > 0, e > 0, show 
that 

(2.34) sup P(\X - x \ <r)< C d , E r d f \F x (t)\ dt 

x eR d JteR d :\t\<e/r 

for some constant Cd, e depending only on d and e. (Hint: Use (2.33) 
for a suitable Schwartz function <p.) The left-hand side of (2.34) is 
known as the small ball probability of X at radius r. 
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In Fourier analysis, we learn that the Fourier transform is a par- 
ticularly well-suited tool for studying convolutions. The probability 
theory analogue of this fact is that characteristic functions are a par- 
ticularly well-suited tool for studying sums of independent random 
variables. More precisely, we have 

Exercise 2.2.12 (Fourier identities). Let V be a finite-dimensional 
real or complex vector space, and let X, Y be independent random 
variables taking values in V. Then 

(2.35) F x+Y (t) = F x (t)F Y (t) 
for all t € V. Also, for any scalar c, one has 

Fcx(t) = F x {ct) 

and more generally, for any linear transformation T : V — > V, one 
has 

F TX (t) = F x (T*t). 

Remark 2.2.7. Note that this identity (2.35), combined with Exer- 
cise 2.2.5 and Remark 2.2.5, gives a quick alternate proof of Exercise 
2.1.9. 

In particular, in the normalised setting (2.29), we have the simple 
relationship 

(2.36) F Zn (t) = F x (t/Vn) n 

that describes the characteristic function of Z n in terms of that of X . 

We now have enough machinery to give a quick proof of the cen- 
tral limit theorem: 

Proof of Theorem 2.2.1. We may normalise X to have mean zero 
and variance 1. By Exercise 2.2.8, we thus have 

F x (t) = l-t 2 /2 + o(\t\ 2 ) 

for sufficiently small t, or equivalently 

F x (;)=expH 2 /2 + (|;| 2 )) 

for sufficiently small t. Applying (2.36), we conclude that 

F Zn (t) ^exp(-t 2 /2) 
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as n — > oo for any fixed t. But by Exercise 2.2.5, exp(— 1 2 /2) is 
the characteristic function of the normal distribution N(0, 1)r. The 
claim now follows from the Levy continuity theorem. □ 

Exercise 2.2.13 (Vector-valued central limit theorem). Let X = 
(Xi, . . . ,Xd) be a random variable taking values in R d with finite 
second moment. Define the covariance matrix E(X) to be the d x d 
matrix E whose ij th entry is the covariance E(Xj — E(Xi))(Xj — 
E(^)). 

(i) Show that the covariance matrix is positive semi-definite 
real symmetric. 

(ii) Conversely, given any positive definite real symmetric dx d 
matrix E and /j, € R d , show that the normal distribution 
N(fi,T,) R d, given by the absolutely continuous measure 



((2 7 r) d dctS) 1 /2 

has mean fi and covariance matrix S, and has a character- 
istic function given by 

F(t) = e v-t e -*- s */ 2 . 

How would one define the normal distribution iV^, S)Rd if 
E degenerated to be merely positive semi-definite instead of 
positive definite? 

(iii) If S n := X\ + . . . + X n is the sum of n iid copies of X, show 
that 5 "^ M converges in distribution to N(0, E(X)) Rd . 

Exercise 2.2.14 (Complex central limit theorem). Let X be a com- 
plex random variable of mean fieC, whose real and imaginary parts 
have variance er 2 /2 and covariance 0. Let Xi,...,X n = X be iid 
copies of X. Show that asn-> oo, the normalised sums (2.26) con- 
verge in distribution to the standard complex Gaussian N(0, l)c- 

Exercise 2.2.15 (Lindcbcrg central limit theorem). Let X\,X2, ■ ■ ■ 
be a sequence of independent (but not necessarily identically dis- 
tributed) real random variables, normalised to have mean zero and 
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variance one. Assume the (strong) Lindeberg condition 

1 n 

lim limsup — > ElX, >at| 2 = 

JV-s-oo n ^ 

where Xj t> N ■= Xjl(\Xj\ > N) is the truncation of Xj to large 
values. Show that as n — > oo, Xl+ ^~ Xn converges in distribution to 
X(0, 1)r. {Hint: modify the truncation argument.) 

A more sophisticated version of the Fourier- analytic method gives 
a more quantitative form of the central limit theorem, namely the 
Berry-Esseen theorem. 

Theorem 2.2.8 (Berry-Esseen theorem). Let X have mean zero, unit 
variance, and finite third moment. Let Z n := (X\ + • • • + X n )/^/n, 
where X\, . . . , X n are iid copies of X . Then we have 

(2.37) P(Z„ < a) = P(G < a) + 0(^(E|A| 3 )) 

uniformly for all a € R, where G = N(0, 1)r, and the implied con- 
stant is absolute. 



Proof. (Optional) Write e := ElAp/y 7 ?!; our task is to show that 

P(Z„ < a) = P(G < a) + 0(e) 

for all a. We may of course assume that e < 1, as the claim is trivial 
otherwise. 

Let c > be a small absolute constant to be chosen later. Let 
rj : R — > R be a non-negative Schwartz function with total mass 1 
whose Fourier transform is supported in [— c, c], and let ip : R — > R 
be the smoothed out version of l(_ 00j o], defined as 




l{-oo,o](x - ey)r)(y) dy. 



Observe that <p is decreasing from 1 to 0. 
We claim that it suffices to show that 



(2.38) ~E<p{Z n -a) = E(p(G - a) + O v {e) 
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for every a, where the subscript means that the implied constant 
depends on r\. Indeed, suppose that (2.38) held. Define 

(2.39) p := sup|P(Z„ < a) - P{G < a)\ 

a 

thus our task is to show that p = 0(e). 

Let a be arbitrary, and let K > be a large absolute constant to 
be chosen later. We write 

P(Z n < a) < ELp(Z n - a- Ke) 

+ E(l - <p{Z n -a- Ke))I(Z n < a) 

and thus by (2.38) 

P(Z„ < a) < Eip(G -a~Ke) 

+ E(l - (f{Z n -a- Ke))I(Z n < a) + O v (e). 

Meanwhile, from (2.39) and an integration by parts we see that 
E(l - ip(Z n -a- Ke))l(Z n < a) = E(l - <p(G -a- Ke))l{G < a) 

+ 0{{l-<p{ r Ke))p). 
From the bounded density of G and the rapid decrease of r\ we have 

P<p{G - a -Ke) + E(l - <p{G - a - Ke))l{G < a) 
= V{G <a) + 1hK {e). 
Putting all this together, we see that 

P(Z n <a)< P(G < a) + VtK (e) + 0((1 - <p{-Ks))p). 
A similar argument gives a lower bound 

P(Z n <a)> P(G < a) - VtK (e) - 0(<p(Ke)p), 

and so 

\P(Z n < a)-P(G < a)\ < Or,,K(e)+O(0--<p(-Ke))p)+O(<p(Ke)p). 

Taking suprema over a, we obtain 

P < VtK (e) + 0((1 - <p(-Ke))p) + 0(^{Ke)p). 

If K is large enough (depending on c), we can make 1 — <p(—Ke) and 
ifi(Ke) small, and thus absorb the latter two terms on the right-hand 
side into the left-hand side. This gives the desired bound p — 0(e). 
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It remains to establish (2.38). Applying (2.33), it suffices to show 

that 

(2.40) | f m(FzJt) - F G (t)) dt\ < 0(e). 

Now we estimate each of the various expressions. Standard Fourier- 
analytic computations show that 

<P(t) = l(-cx),a](*)'?(V e ) 

and that 

i ( - 0O;a] (t) = 0( r ^). 

Since fj was supported in [— c, c], it suffices to show that 

(2.41) [ \E* s $Lz*m dt <Q {£) . 

J\t\<c/e 1 + 1*1 

From Taylor expansion we have 

e ltx = l + itX- ^A 2 + 0(\t\ 3 \X\ 3 ) 
for any t; taking expectations and using the definition of e we have 
F x (it) = l-t 2 /2 + 0(eVn\t\ 3 ) 

and in particular 

F x (t) = exp(-t 2 /2 + 0(eV^\t\ 3 )) 

if \t\ < c/E|X| 3 and c is small enough. Applying (2.36), we conclude 
that 

F Zn (t)=cxp(-t 2 /2 + 0(e\t\ 3 )) 

if \t\ < ce. Meanwhile, from Exercise 2.2.5 we have Fc(t) = cxp(— 1 2 /2). 
Elementary calculus then gives us 

\F z Jt) - F G (t)\ <0(e\t\ 3 exp(-t 2 /4)) 

(say) if c is small enough. Inserting this bound into (2.41) we obtain 
the claim. □ 

Exercise 2.2.16. Show that the error terms here are sharp (up to 
constants) when A is a signed Bernoulli random variable. 
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2.2.3. The moment method. The above Fourier-analytic proof of 
the central limit theorem is one of the quickest (and slickest) proofs 
available for this theorem, and is accordingly the "standard" proof 
given in probability textbooks. However, it relies quite heavily on the 
Fourier-analytic identities in Exercise 2.2.12, which in turn are ex- 
tremely dependent on both the commutative nature of the situation 
(as it uses the identity e A+B = e A e B ) and on the independence of the 
situation (as it uses identities of the form E(e A e B ) = (Ee A )(Ee B )). 
When we turn to random matrix theory, we will often lose (or be 
forced to modify) one or both of these properties, which often causes 
the Fourier-analytic methods to fail spectacularly. Because of this, 
it is also important to look for non-Fourier based methods to prove 
results such as the central limit theorem. These methods often lead to 
proofs that are lengthier and more technical than the Fourier proofs, 
but also tend to be more robust, and in particular can often be ex- 
tended to random matrix theory situations. Thus both the Fourier 
and non- Fourier proofs will be of importance in this text. 

The most elementary (but still remarkably effective) method avail- 
able in this regard is the moment method, which we have already used 
in Section 2.1. This method to understand the distribution of a ran- 
dom variable X via its moments X k . In principle, this method is 
equivalent to the Fourier method, through the identity (2.32); but in 
practice, the moment method proofs tend to look somewhat different 
than the Fourier-analytic ones, and it is often more apparent how to 
modify them to non-independent or non-commutative settings. 

We first need an analogue of the Levy continuity theorem. Here 
we encounter a technical issue: whereas the Fourier phases x M> e %tx 
were bounded, the moment functions x x k become unbounded at 
infinity. However, one can deal with this issue as long as one has 
sufficient decay: 

Theorem 2.2.9 (Carleman continuity theorem). Let X n be a se- 
quence of uniformly subgaussian real random variables, and let X be 
another subgaussian random variable. Then the following statements 
are equivalent: 

(i) For every k = 0, 1, 2, . . ., EX k converges pointwise to EX k . 
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(ii) X n converges in distribution to X. 

Proof. We first show how (ii) implies (i) . Let N > be a truncation 
parameter, and let ip : R — > R be a smooth function that equals 
1 on [—1,1] and vanishes outside of [—2,2]. Then for any k, the 
convergence in distribution implies that EX k (p(X n /N) converges to 
EX k ip(X/N). On the other hand, from the uniform subgaussian 
hypothesis, one can make EX%(l-<p(X n /N)) and EX k (l - <p(X/N)) 
arbitrarily small for fixed k by making N large enough. Summing, 
and then letting N go to infinity, we obtain (i). 

Conversely, suppose (i) is true. From the uniform subgaussian 
hypothesis, the X n have (k + l) st moment bounded by (Ck) k ^ 2 for 
all k > 1 and some C independent of k (see Exercise 1.1.4). From 
Taylor's theorem with remainder (and Stirling's formula, Section 1.2) 
we conclude 

k 

Fx n (t) = £ ^EI^ + 0((Ck)- k ' 2 \t\ k+1 ) 

uniformly in t and n. Similarly for X. Taking limits using (i) we see 
that 

limsup \F Xn (t) - F x (t)\ = 0({Ck)- k / 2 \t\ k+1 ). 

n—toc 

Then letting k — > oo, keeping t fixed, we see that Fx n {t) converges 
pointwisc to Fx(t) for each t, and the claim now follows from the 
Levy continuity theorem. □ 

Remark 2.2.10. One corollary of Theorem 2.2.9 is that the distri- 
bution of a subgaussian random variable is uniquely determined by 
its moments (actually, this could already be deduced from Exercise 
2.2.9 and Remark 2.2.5). The situation can fail for distributions with 
slower tails, for much the same reason that a smooth function is not 
determined by its derivatives at one point if that function is not an- 
alytic. 

The Fourier inversion formula provides an easy way to recover the 
distribution from the characteristic function. Recovering a distribu- 
tion from its moments is more difficult, and sometimes requires tools 
such as analytic continuation; this problem is known as the inverse 
moment problem and will not be discussed here. 
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To prove the central limit theorem, we know from the truncation 
method that we may assume without loss of generality that X is 
bounded (and in particular subgaussian); we may also normalise X 
to have mean zero and unit variance. From the Chernoff bound (2.28) 
we know that the Z n are uniformly subgaussian; so by Theorem 2.2.9, 
it suffices to show that 

EZ* -> EG k 

for all k = 0, 1, 2, . . ., where G = N(0, 1)r is a standard Gaussian 
variable. 

The moments EG k are easy to compute: 

Exercise 2.2.17. Let k be a natural number, and let G = N(0, 1)r. 
Show that EG* vanishes when k is odd, and equal to 2 fc/2(i/2)! wncn & 
is even. (Hint: This can either be done directly by using the Gamma 
function, or by using Exercise 2.2.5 and Exercise 2.2.9.) 

So now we need to compute EZ k . Using (2.29) and linearity of 
expectation, we can expand this as 

EZ* = n" fc / 2 £ K.Y, . . . A , . 

l<ii,...,ifc<n 

To understand this expression, let us first look at some small values 
of A. 

(i) For k = 0, this expression is trivially 1. 

(ii) For k = 1, this expression is trivially 0, thanks to the mean 
zero hypothesis on X. 

(iii) For k = 2, we can split this expression into the diagonal and 
off-diagonal components: 

l<i<n l<i<j<n 

Each summand in the first sum is 1, as X has unit variance. 
Each summand in the second sum is 0, as the Xi have mean 
zero and are independent. So the second moment ~EZ% is 1. 
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(iv) For k — 3, we have a similar expansion 

n -3/2 ^ EI, 3 + n- 3 / 2 ^ ESXfXj + 3XiX? 

l<i<n l<i<j<n 

+ n ~ 3/2 H E6X<X,-X fc . 

1<2< j<k<n 

The summands in the latter two sums vanish because of the 
(joint) independence and mean zero hypotheses. The sum- 
mands in the first sum need not vanish, but are O(l), so the 
first term is 0(n~ 1//2 ), which is asymptotically negligible, so 
the third moment EZ^ goes to 0. 

(v) For k = 4, the expansion becomes quite complicated: 
n- 2 EX? + n- 2 mXfXj + 6Xf X 2 + 4XiXf 

l<i<n l^«<J^w 

+ n~ 2 Y E12X 2 X 3 X k + \2X t X 2 X k + 12X^X1 

l<i<j<k<n 

+ n- 2 Y E24X i X,-X fc X,. 

l<i<j<k<l<n 

Again, most terms vanish, except for the first sum, which 
is 0{n~ l ) and is asymptotically negligible, and the sum 
■nT 2 J2i<i<j< n E6X 2 X 2 , which by the independence and 
unit variance assumptions works out to n~ 2 6{^) = 3 + o(l). 
Thus the fourth moment EZ^ goes to 3 (as it should). 

Now we tackle the general case. Ordering the indices i 1; . . . ,i k 
as ji < ... < j m for some 1 < m < k, with each j r occuring with 
multiplicity a r > 1 and using elementary enumerative combinatorics, 
we see that EZ k is the sum of all terms of the form 

(2.42) n- k ' 2 ]T c Ml ,..., am E^...^ 

l<jl<...<j m <n 

where 1 < m < k, a\, . . . , a m are positive integers adding up to k, 
and Ck, ai ,...M m is the multinomial coefficient 

fc! 

c k,a-\ a™ • i T- 
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The total number of such terms depends only on k. More pre- 
cisely, it is 2 k ~ 1 (exercise!), though we will not need this fact. 

As we already saw from the small k examples, most of the terms 
vanish, and many of the other terms are negligible in the limit n — > oo. 
Indeed, if any of the a r are equal to 1, then every summand in (2.42) 
vanishes, by joint independence and the mean zero hypothesis. Thus, 
we may restrict attention to those expressions (2.42) for which all the 
a r are at least 2. Since the a r sum up to k, we conclude that m is at 
most k/2. 

On the other hand, the total number of summands in (2.42) is 
clearly at most n m (in fact it is ( " ) ) , and the summands are bounded 
(for fixed k) since X is bounded. Thus, if m is strictly less than k/2, 
then the expression in (2.42) is 0(n m ~ fe / 2 ) and goes to zero as n — > oo. 
So, asymptotically, the only terms (2.42) which are still relevant are 
those for which m is equal to k/2. This already shows that EZ^ goes 
to zero when k is odd. When k is even, the only surviving term in 
the limit is now when m = k/2 and a\ = . . . = a m = 2. But then by 
independence and unit variance, the expectation in (2.42) is 1, and 
so this term is equal to 



and the main term is happily equal to the moment EG fc as computed 
in Exercise 2.2.17. 

2.2.4. The Lindeberg swapping trick. The moment method proof 
of the central limit theorem that we just gave consisted of four steps: 

(i) (Truncation and normalisation step) A reduction to the case 
when X was bounded with zero mean and unit variance. 

(ii) (Inverse moment step) A reduction to a computation of as- 
ymptotic moments lim^oo . 

(iii) (Analytic step) Showing that most terms in the expansion 
of this asymptotic moment were zero, or went to zero as 
n — > oo. 

(iv) (Algebraic step) Using enumerative combinatorics to com- 
pute the remaining terms in the expansion. 
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In this particular case, the enumerative combinatorics was very 
classical and easy - it was basically asking for the number of ways one 
can place k balls in m boxes, so that the r th box contains a r balls, 
and the answer is well known to be given by the multinomial ai! fc! g ; ■ 
By a small algebraic miracle, this result matched up nicely with the 
computation of the moments of the Gaussian iV(0, 1)r. 

However, when we apply the moment method to more advanced 
problems, the enumerative combinatorics can become more non-trivial, 
requiring a fair amount of combinatorial and algebraic computation. 
The algebraic miracle that occurs at the end of the argument can 
then seem like a very fortunate but inexplicable coincidence, making 
the argument somehow unsatisfying despite being rigorous. 

In [Lil922], Lindeberg observed that there was a very simple way 
to decouple the algebraic miracle from the analytic computations, 
so that all relevant algebraic identities only need to be verified in 
the special case of gaussian random variables, in which everything is 
much easier to compute. This Lindeberg swapping trick (or Lindeberg 
replacement trick) will be very useful in the later theory of random 
matrices, so we pause to give it here in the simple context of the 
central limit theorem. 

The basic idea is follows. We repeat the truncation-and-normalisation 
and inverse moment steps in the preceding argument. Thus, X\ , . . . , X n 
are iid copies of a bounded real random variable X of mean zero 
and unit variance, and we wish to show that EZ* — > EG fc , where 
G = N(0, 1) R , where fc > is fixed. 

Now let Y\ , . . . , Y n be iid copies of the Gaussian itself: Y\ , . . . , Y n = 
-^(0, 1)r- Because the sum of independent Gaussians is again a 
Gaussian (Exercise 2.1.9), we see that the random variable 

W n := -= 

already has the same distribution as G: W n = G. Thus, it suffices to 
show that 

EZ* = EM/„ fe + (l). 

Now we perform the analysis part of the moment method argument 
again. We can expand EZ^ into terms (2.42) as before, and discard 
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all terms except for the a\ = . . . = a m = 2 term as being o(l). 
Similarly, we can expand EW^ into very similar terms (but with the 
Xi replaced by Yi) and again discard all but the a\ = . . . = a m term. 

But by hypothesis, the second moments of X and Y match: 
EX 2 = EY 2 = 1. Thus, by joint independence, the ai = ... = 
a m — 2 term (2.42) for X is exactly equal to that of Y. And the 
claim follows. 

This is almost exactly the same proof as in the previous section, 
but note that we did not need to compute the multinomial coefficient 

Ck. ai a m , nor did we need to verify the miracle that this coefficient 

matched (up to normalising factors) to the moments of the Gaussian. 
Instead, we used the much more mundane "miracle" that the sum of 
independent Gaussians was again a Gaussian. 

To put it another way, the Lindeberg replacement trick factors a 
universal limit theorem, such as the central limit theorem, into two 
components: 

(i) A universality or invariance result, which shows that the 
distribution (or other statistics, such as moments) of some 
random variable F(X\, . . . , X n ) is asymptotically unchanged 
in the limit n — > oo if each of the input variables Xi are re- 
placed by a Gaussian substitute Yi; and 

(ii) The gaussian case, which computes the asymptotic distri- 
bution (or other statistic) of F(Y\, . . . , Y n ) in the case when 
Yi, . . . , Y n are all Gaussians. 

The former type of result tends to be entirely analytic in nature (ba- 
sically, one just needs to show that all error terms that show up when 
swapping X with Y add up to o(l)), while the latter type of result 
tends to be entirely algebraic in nature (basically, one just needs to 
exploit the many pleasant algebraic properties of Gaussians). This 
decoupling of the analysis and algebra steps tends to simplify the ar- 
gument both at a technical level and at a conceptual level, as each 
step then becomes easier to understand than the whole. 

2.2.5. Individual swapping. In the above argument, we swapped 
all the original input variables X\ , . . . , X n with Gaussians Y\,. . . ,Y n 
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en masse. There is also a variant of the Lindeberg trick in which the 
swapping is done individually. To illustrate the individual swapping 
method, let us use it to show the following weak version of the Berry- 
Esseen theorem: 

Theorem 2.2.11 (Berry-Esseen theorem, weak form). Let X have 
mean zero, unit variance, and finite third moment, and let ip be 
smooth with uniformly bounded derivatives up to third order. Let 
Z n := (Xi + • • • + X n )/y/n, where X\, . . . ,X n are iid copies of X . 
Then we have 

(2.43) Etp(Z n ) = Ep(G) + 0(^-(E\X\ 3 ) sup \<p"'(x)\) 

V n xeR 

where G = iV(0,l) R . 

Proof. Let Yi , . . . , Y n and W n be in the previous section. As W n = 
G, it suffices to show that 

E<p(Z n ) - <p(W n ) = o(l). 

We telescope this (using linearity of expectation) as 

71-1 

E(f(Z n ) - <p(W n ) = - ^ V<P(Zn,i) - tp(Z n , i+ i) 

i=0 

where 

._ Xi + ■ ■ ■ + Xi + Y l+1 + ■ ■ ■ + Y n 

is a partially swapped version of Z n . So it will suffice to show that 
Eip(Z nii ) - v(Z n , l+1 ) = 0({-E\X\ 3 ) sup \<p"'{x)\/n 3 / 2 ) 

uniformly for < i < n. 

We can write Z n ^ = S nii +Y i+1 /y/n and Z n ^ l+1 = S nii +X i+1 /y/n, 
where 

(2.44) S n , := ^ + - + X l + Y l+2 + .- + Y^ 

To exploit this, we use Taylor expansion with remainder to write 

<P{Zn,i) = f(Sn,i) + <p'(S n ,i)Yi+l/Vn 

+ l - V "{S n ,)Y? +1 /n + 0(\Y l+1 \ 3 /n^ sup \<p"'(x)\) 
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and 

ip(Z nii+1 ) = <p{S n>i ) + ip'(S n ,i)Xi +1 /y/n 

+ ^"(S^XlJn + OdX^/n 3 / 2 sup \</"{x)\) 

where the implied constants depend on ip but not on n. Now, by 
construction, the moments of Xi + \ and Yi + i match to second order, 
thus 

E V (Z n>i ) - V (Z„ ii+ i) = 0(E|F l+1 | 3 sup \p'"(x)\/n 3 ^ 2 ) 

+ 0(E|^ +1 | 3 sup|^"'( a ;)|/n 3 / 2 ), 

and the claim follows 11 . □ 

Remark 2.2.12. The above argument relied on Taylor expansion, 
and the hypothesis that the moments of X and Y matched to second 
order. It is not hard to see that if we assume more moments matching 
(e.g. EX 3 = ~EY 3 = 3), and more smoothness on ip, we see that we 
can improve the factor on the right-hand side. Thus we sec that 
we expect swapping methods to become more powerful when more 
moments are matching. We will see this when we discuss the four 
moment theorem of Van Vu and myself in later lectures, which (very) 
roughly speaking asserts that the spectral statistics of two random 
matrices are asymptotically indistinguishable if their coefficients have 
matching moments to fourth order. 

Theorem 2.2.11 is easily implied by Theorem 2.2.8 and an inte- 
gration by parts. In the reverse direction, let us sec what Theorem 
2.2.11 tells us about the cumulative distribution function 

P(Z n < a) 

of Z n . For any e > 0, one can upper bound this expression by 

E<p(Z n ) 

where ip is a smooth function equal to 1 on (— oo, a] that vanishes 
outside of (— oo, a + e], and has third derivative 0(s~ 3 ). By Theorem 



Note from Holder's inequality that E|X| 3 > 1 
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2.2.11, we thus have 

P(Z n <a)< E<p(G) + 0(^-(E\X\ 3 )e- 3 ). 

On the other hand, as G has a bounded probability density function, 
we have 

E<p{G) = P(G <a) + 0(e) 

and so 

P(Z„ < a) < P(G < a) + 0(e) + 0(^(E|X| 3 ) £ - 3 ). 
A very similar argument gives the matching lower bound, thus 

P(Z„ < a) = P(G < a) + 0(e) + 0(^(E|X| 3 ) £ - 3 ). 
Optimising in e we conclude that 

(2.45) P(Z n <a)= P(G < a) + 0(^(E|X| 3 )) 1 / 4 . 

Comparing this with Theorem 2.2.8 we see that the decay exponent in 
n in the error term has degraded by a factor of 1/4. In our applications 
to random matrices, this type of degradation is acceptable, and so the 
swapping argument is a reasonable substitute for the Fourier-analytic 
one in this case. Also, this method is quite robust, and in particular 
extends well to higher dimensions; we will return to this point in 
later lectures, but see for instance [TaVuKr2010, Appendix D] for 
an example of a multidimensional Bcrry-Essecn theorem proven by 
this method. 

On the other hand there is another method that can recover this 
loss while still avoiding Fourier- analytic techniques; we turn to this 
topic next. 

2.2.6. Stein's method. Stein's method, introduced by Charles Stein [St 1970], 
is a powerful method to show convergence in distribution to a spe- 
cial distribution, such as the Gaussian. In several recent papers, this 
method has been used to control several expressions of interest in 
random matrix theory (e.g. the distribution of moments, or of the 
Stieltjes transform.) We will not use Stein's method in this text, but 
the method is of independent interest nonetheless. 
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The probability density function p(x) := -J=e~ x2 / 2 of the stan- 
dard normal distribution N(0, 1)r can be viewed as a solution to the 
ordinary differential equation 

(2.46) p'(x) + xp(x) = 0. 

One can take adjoints of this, and conclude (after an integration by 
parts) that p obeys the integral identity 

/ p(x)(f(x)-xf(x)) dx = 

JR 

for any continuously differentiable / with both / and /' bounded 
(one can relax these assumptions somewhat). To put it another way, 
if G = N(0, 1), then wc have 

(2.47) E/'(G) - Gf(G) = 

whenever / is continuously differentiable with /, /' both bounded. 

It turns out that the converse is true: if X is a real random 
variable with the property that 

E/'(X) - Xf(X) = 

whenever / is continuously differentiable with /, /' both bounded, 
then X is Gaussian. In fact, more is true, in the spirit of Theorem 
2.2.4 and Theorem 2.2.9: 

Theorem 2.2.13 (Stein continuity theorem). Let X n be a sequence 
of real random variables with uniformly bounded second moment, and 
let G = N(0, 1). Then the following are equivalent: 

(i) E/'(X„) — X n f(X n ) converges to zero whenever f : R — > R 
is continuously differentiable with /, /' both bounded. 

(ii) X n converges in distribution to G. 

Proof. To show that (ii) implies (i) , it is not difficult to use the uni- 
form bounded second moment hypothesis and a truncation argument 
to show that Ef(X n )-X n f(X n ) converges to E.f(G) - Gf(G) when 
/ is continuously differentiable with /, /' both bounded, and the claim 
then follows from (2.47). 

Now we establish the converse. It suffices to show that 



Eip(X n ) - E<p{G) -> 
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whenever ip : R — > R is a bounded continuous function. We may 
normalise ip to be bounded in magnitude by 1. 

Trivially, the function i^(-) — E<p(G) has zero expectation when 
one substitutes G for the argument •, thus 

(2.48) — / e-y 2 / 2 (p(y)-Ep(G))dy = 0. 



Comparing this with (2.47), one may thus hope to find a representa- 
tion of the form 

(2.49) <p(x) - Etp(G) = f'(x) - xf(x) 

for some continuously differentiable / with /, /' both bounded. This 
is a simple ODE and can be easily solved (by the method of integrating 
factors) to give a solution /, namely 



(2.50) 



f{x) := e^/ 2 f e-y 2 ' 2 {p{y) E<p(G)) dy. 

J — oo 



(One could dub / the Stein transform of if, although this term does 
not seem to be in widespread use.) By the fundamental theorem 
of calculus, / is continuously differentiable and solves (2.49). Using 
(2.48), we may also write / as 

/•oo 

(2.51) f{x) := -e 3 ^ 2 / e^/ 2 ^) - E^(G)) dy. 

J X 

By completing the square, we see that e~ y2 / 2 < e~ x2 / 2 e~ xlyV ~ x \ In- 
serting this into (2.50) and using the bounded nature of <p, we con- 
clude that f{x) = O v il/\x\) for x < —1; inserting it instead into 
(2.51), we have fix) = O v il/\x\) for x > 1. Finally, easy estimates 
give fix) = O v (l) for \x\ < 1. Thus for all x we have 

'<*> = °*<TTR> 

which when inserted back into (2.49) gives the boundedness of /' (and 
also of course gives the boundedness of /). In fact, if we rewrite (2.51) 
as 

/•OO 

fix) :=- e- s2 / 2 e- sx ip(x + s) - E^(G)) ds, 
Jo 
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we see on differentiation under the integral sign (and using the Lip- 
schitz nature of ip) that f'(x) = O v (l/x) for x > 1; a similar ma- 
nipulation (starting from (2.50)) applies for x < — 1, and we in fact 
conclude that f'(x) = 0<j>( ) for all x. 

Applying (2.49) with x = X n and taking expectations, we have 

<p(X n ) - Etp(G) = f(X n ) - X n f(X n ). 

By the hypothesis (i), the right-hand side goes to zero, hence the 
left-hand side does also, and the claim follows. □ 

The above theorem gave only a qualitative result (convergence 
in distribution), but the proof is quite quantitative, and can be used 
in particular to give Berry-Esseen type results. To illustrate this, 
we begin with a strengthening of Theorem 2.2.11 that reduces the 
number of derivatives of ip that need to be controlled: 

Theorem 2.2.14 (Berry-Esseen theorem, less weak form). Let X 
have mean zero, unit variance, and finite third moment, and let ip 
be smooth, bounded in magnitude by 1, and Lipschitz. Let Z n := 
(Xi + ■ ■ ■ + X n )/^/n, where Xi,... ,X n are iid copies of X . Then we 
have 

(2.52) Ep(Z n ) = Ep(G) + 0(^(E|X| 3 )(1 + sup \<p'(x)\)) 

V n xeR 

where G = iV(0,l) R . 

Proof. Set A := 1 + sup x£R |<p'(a;)|. 

Let / be the Stein transform (2.50) of p, then by (2.49) we have 
Ep(Z n ) - Ep(G) = E/'(Z„) - Z n f(Z n ). 

We expand Z n f(Z n ) = J27=i Xif(Z n ). For each i, we then split 
Z n = Z n -i + ^Xi, where Z n;i := (Xi + • • • + X^i + X i+X + • • • + 
X n )/\/n (cf. (2.44)). By the fundamental theorem of calculus, we 
have 

VXJ(Z n ) = VXJ(Z n[l ) + -^Xff'iZ^ + -±=Xi) 

V n \Jn 

where t is uniformly distributed in [0, 1] and independent of all of the 
Xi, . . . , X n . Now observe that Xi and Z n -i are independent, and Xi 
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has mean zero, so the first term on the right-hand side vanishes. Thus 

1 " t 
(2.53) Etp(Z n ) - E<p(G) = - V E/'(Z„) - Xff(Z nyl + 

^ . .. v ^ 



Another application of indepcndcndence gives 

E/'(Z n;j ) = El t 2 /'(Z n;1 ) 

so we may rewrite (2.53) as 

1 " f 

- V E(/'(Z„) - /'(Z„ ;i )) - X 4 2 (/'(Z n;4 + - f{Z n .i)). 

Recall from the proof of Theorem 2.2.13 that /(a;) = 0(1/(1 + \x\)) 
and f'(x) = 0(A/(l + \x\)). By the product rule, this implies that 
xf(x) has a Lipschitz constant of 0(A). Applying (2.49) and the 
definition of A, we conclude that /' has a Lipschitz constant of 0{A). 
Thus we can bound the previous expression as 



n 

- E——0(A\X l \ + A\Xi 



n 

i=l 



and the claim follows from Holder's inequality. □ 

This improvement already partially restores the exponent in (2.45) 
from 1/4 to 1/2. But one can do better still by pushing the arguments 
further. Let us illustrate this in the model case when the Xj not only 
have bounded third moment, but are in fact bounded: 

Theorem 2.2.15 (Berry-Esseen theorem, bounded case). Let X have 
mean zero, unit variance, and be bounded by 0(1). Let Z n := (Xi + 
■ ■ ■ + X n ) j \fn, where X\, . . . , X n are iid copies of X . Then we have 

(2.54) P(Z n <a)=P(G <a) + 0{^=) 

whenever a = 0(1), where G = N(0, 1)r. 

Proof. Write </> := l(_oo ia ], thus we seek to show that 

Ety(Z„) - 4>(G) = 0(^=). 

v n 

Let / be the Stein transform (2.50) of 4>. 4> is not continuous, but it is 
not difficult to see (e.g. by a limiting argument) that we still have the 
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estimates f(x) = 0(1/(1 + |x|)) and f'(x) = 0(1) (in a weak sense), 
and that xf has a Lipschitz norm of O(l) (here we use the hypothesis 
a = 0(1)). A similar limiting argument gives 

E0(Z„) - 0(G) = E/'(Z„) - Z n f(Z n ) 

and by arguing as in the proof of Theorem 2.2.14, we can write the 
right-hand side as 

1 J2 W{Z n ) f(Z n;l )) X?(f(Z n;l + -±=Xi) f'(Z n;i )). 

From (2.49), /' is equal to <j>, plus a function with Lipschitz norm 
O(l). Thus, we can write the above expression as 

1 " t 

- V E(<f>(Z n )- <f>{Z nii ))-Xf (4>(Z n .i + T ^)- <f>(Z nii )) + 0(l/Vn). 

The (j)(Z n .i) terms cancel (due to the independence of Xi and Z n; i, 
and the normalised mean and variance of Xi), so we can simplify this 

as 

1 ™ t 

and so we conclude that 

n 

- VXfHZn-^ + -pXO = E0(G) + 0(l/v^). 

71 . y/ 71 

i=l v 

Since t and X 4 are bounded, and <p is non-increasing, we have 
<f>(Z n;i + 0(1/ y/n)) < <p(Z n:i + -^-Xi) < 0(Z n:l - 0(l/y^)); 

\J7l 

applying the second inequality and using independence to once again 
eliminate the Xf factor, we see that 

1 " 

- V E<jy(Z n:t - 0{1/Vn)) > E0(G) + 0(1/ y/n) 



n . 

2—1 



which implies (by another appeal to the non-increasing nature of 
and the bounded nature of Xi) that 

E0(Z„ - 0(1/ ,/n)) > E0(G) + 0(1/ y/n) 
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or in other words that 



P(Z n <a + 0(1/ Jn)) > P(G < a) + 0(1/ y/n). 



Similarly, using the lower bound inequalities, one has 



P(Z n <a- 0(l/Vn)) < P(G < a) + 0(1/Vn). 



Moving a up and down by 0(1/ \/n), and using the bounded density 



Actually, one can use Stein's method to obtain the full Berry- 
Esseen theorem, but the computations get somewhat technical, re- 
quiring an induction on n to deal with the contribution of the excep- 
tionally large values of Xc see [BaHal984]. 

2.2.7. Predecessor comparison. Suppose one had never heard of 
the normal distribution, but one still suspected the existence of the 
central limit theorem - thus, one thought that the sequence Z n of 
normalised distributions was converging in distribution to something, 
but was unsure what the limit was. Could one still work out what 
that limit was? 

Certainly in the case of Bernoulli distributions, one could work ex- 
plicitly using Stirling's formula (see Exercise 2.2.2), and the Fourier- 
analytic method would also eventually work. Let us now give a third 
way to (heuristically) derive the normal distribution as the limit of the 
central limit theorem. The idea is to compare Z n with its predecessor 
Z n -i, using the recursive formula 



(normalising X n to have mean zero and unit variance as usual; let us 
also truncate X n to be bounded, for simplicity). Let us hypothesise 
that Z n and Z n _\ are approximately the same distribution; let us 
also conjecture that this distribution is absolutely continuous, given 
as p(x) dx for some smooth p(x). (If we secretly knew the central 
limit theorem, we would know that p(x) is in fact ~^e- x I 2 , but let 



us pretend that we did not yet know this fact.) Thus, for any test 



of G, we obtain the claim. 



□ 



(2.55) 




n 
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function ip, we expect 

(2.56) Ep(Z n ) « E^(Z n _i) w / ip(x)p(x) dx. 

Jr 

Now let us try to combine this with (2.55). We assume 93 to be 
smooth, and Taylor expand to third order: 

^>="(^ z "-') + ^M^ z "-') 

Taking expectations, and using the independence of X n and Z n _i, 
together with the normalisations on X n , we obtain 

Ey(Z „, . E , (^Zj) + ±S (^„-.) + 0(^1. 

Up to errors of 0(^572), one can approximate the second term here 
by ^<p" '(Z n _i). We then insert (2.56) and are led to the heuristic 
equation 

J r <P{x)p{x) w J^ip { ^ n ^ l p( x ) + ^P"{x)p{x) dx + 0(^/2 )• 

Changing variables for the first term on the right hand side, and 
integrating by parts for the second term, we have 

+ 7T<P(x)p"{x) dx + 0{-^). 

Since ip was an arbitrary test function, this suggests the heuristic 
equation 

Taylor expansion gives 

which leads us to the heuristic ODE 

Lp(x) = 
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where L is the Ornstein- Uhlenbeck operator 

Lp(x) := p(x) + xp'(x) + p"{x). 

Observe that Lp is the total derivative of xp{x) + p'(x); integrating 
from infinity, we thus get 

xp{x) + p'(x) = 

which is (2.46), and can be solved by standard ODE methods as 
p(x) = ce~ x I 2 for some c; the requirement that probability density 
functions have total mass 1 then gives the constant c as -A=, as we 

V Z7T 

knew it must. 

The above argument was not rigorous, but one can make it so 
with a significant amount of PDE machinery. If we view n (or more 
precisely, logn) as a time parameter, and view (j> as depending on 
time, the above computations heuristically lead us eventually to the 
Fokker-Planck equation 

d t p(t,x) = Lp 

for the Ornstein- Uhlenbeck process, which is a linear parabolic equa- 
tion that is fortunate enough that it can be solved exactly (indeed, it is 
not difficult to transform this equation to the linear heat equation by 
some straightforward changes of variable). Using the spectral theory 
of the Ornstein-Uhlenbeck operator L, one can show that solutions 
to this equation starting from an arbitrary probability distribution, 
arc attracted to the Gaussian density function ~^^ e ~ x which as 
we saw is the steady state for this equation. The stable nature of this 
attraction can eventually be used to make the above heuristic analy- 
sis rigorous. However, this requires a substantial amount of technical 
effort (e.g. developing the theory of Sobolev spaces associated to L) 
and will not be attempted here. One can also proceed by relating the 
Fokker-Planck equation to the associated stochastic process, namely 
the Ornstein-Uhlenbeck process, but this requires one to first set up 
stochastic calculus, which we will not do here 12 . Stein's method, dis- 
cussed above, can also be interpreted as a way of making the above 
computations rigorous (by not working with the density function p 



The various Taylor expansion calculations we have performed in this section, 
though, are closely related to stochastic calculus tools such as Ito 's lemma. 
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directly, but instead testing the random variable Z n against various 
test functions tp). 

This argument does, though highlight two ideas which we will 
see again in later sections when studying random matrices. Firstly, 
that it is profitable to study the distribution of some random object 
Z n by comparing it with its predecessor Z n _i, which one presumes 
to have almost the same distribution. Secondly, we see that it may 
potentially be helpful to approximate (in some weak sense) a discrete 
process (such as the iteration of the scheme (2.55)) with a continuous 
evolution (in this case, a Fokker-Planck equation) which can then be 
controlled using PDE methods. 

2.3. The operator norm of random matrices 

Now that we have developed the basic probabilistic tools that we will 
need, we now turn to the main subject of this text, namely the study 
of random matrices. There are many random matrix models (aka 
matrix ensembles) of interest - far too many to all be discussed here. 
We will thus focus on just a few simple models. First of all, we shall 
restrict attention to square matrices M = (£ij)i<ij< n , where n is a 
(large) integer and the £jj are real or complex random variables. (One 
can certainly study rectangular matrices as well, but for simplicity we 
will only look at the square case.) Then, we shall restrict to three 
main models: 

(i) lid matrix ensembles, in which the coefficients £jj are iid 
random variables with a single distribution = £. We 
will often normalise £ to have mean zero and unit vari- 
ance. Examples of iid models include the Bernoulli ensem- 
ble (aka random sign matrices) in which the &j are signed 
Bernoulli variables, the real Gaussian matrix ensemble in 
which £jj = -/V(0, 1)r, and the complex Gaussian matrix 
ensemble in which &j = N(0, l)c- 

(ii) Symmetric Wigner matrix ensembles, in which the up- 
per triangular coefficients j > i are jointly independent 
and real, but the lower triangular coefficients j < i are 
constrained to equal their transposes: £jj = Thus M by 
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construction is always a real symmetric matrix. Typically 
the strictly upper triangular coefficients will be iid, as will 
the diagonal coefficients, but the two classes of coefficients 
may have a different distribution. One example here is the 
symmetric Bernoulli ensemble, in which both the strictly up- 
per triangular and the diagonal entries are signed Bernoulli 
variables; another important example is the Gaussian Or- 
thogonal Ensemble (GOE), in which the upper triangular 
entries have distribution N(0, 1)r and the diagonal entries 
have distribution N(0, 2)r. (We will explain the reason for 
this discrepancy later.) 

(iii) Hermitian Wigner matrix ensembles, in which the up- 
per triangular coefficients are jointly independent, with the 
diagonal entries being real and the strictly upper triangu- 
lar entries complex, and the lower triangular coefficients £,j , 
j < i are constrained to equal their adjoints: &j = Thus 
M by construction is always a Hermitian matrix. This class 
of ensembles contains the symmetric Wigner ensembles as a 
subclass. Another very important example is the Gaussian 
Unitary Ensemble (GUE), in which all off-diagional entries 
have distribution N(0, l)c, but the diagonal entries have 
distribution N(0, 1)r. 

Given a matrix ensemble M, there are many statistics of M that 
one may wish to consider, e.g. the eigenvalues or singular values of 
M, the trace and determinant, etc. In this section we will focus on a 
basic statistic, namely the operator norm 

(2.57) ||M|| op := sup \Mx\ 

zeC n :|x|=l 

of the matrix M. This is an interesting quantity in its own right, but 
also serves as a basic upper bound on many other quantities. (For 
instance, ||M|| op is also the largest singular value a\(M) of M and 
thus dominates the other singular values; similarly, all eigenvalues 
Aj(M) of M clearly have magnitude at most ||M|| op .) Because of 
this, it is particularly important to get good upper tail bounds 

P(||M|| op >A)<... 
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on this quantity, for various thresholds A. (Lower tail bounds are also 
of interest, of course; for instance, they give us confidence that the 
upper tail bounds are sharp.) Also, as we shall see, the problem of up- 
per bounding ||M|| op can be viewed as a non-commutative analogue 13 
of upper bounding the quantity |5 n | studied in Section 2.1. 

An n x n matrix consisting entirely of Is has an operator norm 
of exactly n, as can for instance be seen from the Cauchy-Schwarz 
inequality. More generally, any matrix whose entries are all uni- 
formly 0(1) will have an operator norm of 0(n) (which can again 
be seen from Cauchy-Schwarz, or alternatively from Schur's test (see 
e.g. [Ta2010, §1.11]), or from a computation of the Frobenius norm 
(see (2.63))). However, this argument does not take advantage of pos- 
sible cancellations in M. Indeed, from analogy with concentration of 
measure, when the entries of the matrix M are independent, bounded 
and have mean zero, we expect the operator norm to be of size 0(y/n) 
rather than 0(n). We shall see shortly that this intuition is indeed 
correct 14 . 

As mentioned before, there is an analogy here with the concen- 
tration of measure 15 phenomenon, and many of the tools used in the 
latter (e.g. the moment method) will also appear here. Similarly, just 
as many of the tools from concentration of measure could be adapted 
to help prove the central limit theorem, several of the tools seen here 
will be of use in deriving the semicircular law in Section 2.4. 

The most advanced knowledge we have on the operator norm is 
given by the Tracy-Widom law, which not only tells us where the 
operator norm is concentrated in (it turns out, for instance, that for 
a Wigner matrix (with some additional technical assumptions), it is 
concentrated in the range [2y/n - C^n" 1 / 6 ), 2y/n + C^n" 1 / 6 )]), but 
what its distribution in that range is. While the methods in this 
section can eventually be pushed to establish this result, this is far 

-^The analogue of the central limit theorem studied in Section 2.2 is the Wigner 
semicircular law, which will be studied in Section 2.4. 

^Onc can see, though, that the mean zero hypothesis is important; from the 
triangle inequality we see that if we add the all-ones matrix (for instance) to a random 
matrix with mean zero, to obtain a random matrix whose coefficients all have mean 1, 
then at least one of the two random matrices necessarily has operator norm at least 
n/2. 

15 Indccd, we will be able to use some of the concentration inequalities from Section 
2.1 directly to help control ||M|| op and related quantities. 
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from trivial, and will only be briefly discussed here. We will however 
discuss the Tracy- Widom law at several later points in the text. 

2.3.1. The epsilon net argument. The slickest way to control 
1 1 M 1 1 op is via the moment method. But let us defer using this method 
for the moment, and work with a more "naive" way to control the 
operator norm, namely by working with the definition (2.57). From 
that definition, we see that we can view the upper tail event ||M|| op > 
A as a union of many simpler events: 

(2.58) P(||M|| op > A) < P( V \Mx\ > A) 

xes 

where S := {x e C d : \x\ = 1} is the unit sphere in the complex space 
C d . 

The point of doing this is that the event \Mx\ > A is easier to 
control than the event ||M|| op > A, and can in fact be handled by the 
concentration of measure estimates we already have. For instance: 

Lemma 2.3.1. Suppose that the coefficients of M are indepen- 
dent, have mean zero, and uniformly bounded in magnitude by 1. Let 
x be a unit vector in C n . Then for sufficiently large A (larger than 
some absolute constant), one has 

P(\Mx\ > Ay/n) < Cexp(-cAn) 

for some absolute constants C,c> 0. 

Proof. Let Xi,..., X n be the n rows of M, then the column vector 
Mx has coefficients ■ x for i = 1, . . . , n. if we let X\, . . . , x n be the 
coefficients of a;, so that Y^j=i \ x j\ 2 = 1> tnen ' x is just Y^j=i £ij x j- 
Applying standard concentration of measure results (e.g. Exercise 
2.1.4, Exercise 2.1.5, or Theorem 2.1.13, we see that each Xi ■ x is 
uniformly subgaussian, thus 

P(\Xi-x\ > A) < Cexp(-cA 2 ) 

for some absolute constants C, c > 0. In particular, we have 

Ee c|Xi ' x|2 < C 
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for some (slightly different) absolute constants C, c > 0. Multiplying 
these inequalities together for all i, we obtain 

Ee c|Mx| 2 < C n 

and the claim then follows from Markov's inequality (1.14). □ 

Thus (with the hypotheses of Proposition 2.3.1), we see that for 
each individual unit vector x, we have \Mx\ = 0{y r n) with over- 
whelming probability. It is then tempting to apply the union bound 
and try to conclude that ||M|| op = 0(\/n) with overwhelming prob- 
ability also. However, we encounter a difficulty: the unit sphere S 
is uncountable, and so we are taking the union over an uncountable 
number of events. Even though each event occurs with exponentially 
small probability, the union could well be everything. 

Of course, it is extremely wasteful to apply the union bound to 
an uncountable union. One can pass to a countable union just by 
working with a countable dense subset of the unit sphere S instead of 
the sphere itself, since the map x i— > \Mx\ is continuous. Of course, 
this is still an infinite set and so we still cannot usefully apply the 
union bound. However, the map x ^ \Mx\ is not just continuous; 
it is Lipschitz continuous, with a Lipschitz constant of |jM|j op . Now, 
of course there is some circularity here because ||M|| op is precisely 
the quantity we are trying to bound. Nevertheless, we can use this 
stronger continuity to refine the countable dense subset further, to a 
finite (but still quite dense) subset of S, at the slight cost of modifying 
the threshold A by a constant factor. Namely: 

Lemma 2.3.2. Let E be a maximal 1/2-net of the sphere S, i.e. a 
set of points in S that are separated from each other by a distance of 
at least 1/2, and which is maximal with respect to set inclusion. Then 
for any nxn matrix M with complex coefficients, and any X > 0, we 
have 

P(||M|| op >A)<P(V \My\>\/2). 

yes 

Proof. By (2.57) (and compactness) we can find ieS such that 



\Mx\ = ||M|| 
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This point x need not lie in E. However, as E is a maximal 1/2-net 
of S, we know that x lies within 1/2 of some point y in E (since 
otherwise we could add x to E and contradict maximality). Since 
|o; — j/| < 1/2, we have 

\M(x-y)\ < \\M\\ op /2. 

By the triangle inequality we conclude that 

\My\ > \\M\\ op /2. 

In particular, if ||M|| op > A, then \My\ > A/2 for some y e E, and 
the claim follows. □ 

Remark 2.3.3. Clearly, if one replaces the maximal 1/2-net here 
with an maximal £-net for some other < e < 1 (defined in the 
obvious manner), then we get the same conclusion, but with A/2 
replaced by A/(l — e). 

Now that we have discretised the range of points y to be finite, 
the union bound becomes viable again. We first make the following 
basic observation: 

Lemma 2.3.4 (Volume packing argument). Let < e < 1, and let 

E be a e-net of the sphere S. Then E has cardinality at most {C/e) n 
for some absolute constant C > 0. 

Proof. Consider the balls of radius e/2 centred around each point 
in E; by hypothesis, these are disjoint. On the other hand, by the 
triangle inequality, they are all contained in the ball of radius 3/2 
centred at the origin. The volume of the latter ball is at most (C/e) n 
the volume of any of the small balls, and the claim follows. □ 

Exercise 2.3.1. Conversely, if E is a maximal e-net, show that E 
has cardinality at least (c/e) n for some absolute constant c > 0. 

And now we get an upper tail estimate: 

Corollary 2.3.5 (Upper tail estimate for iid ensembles). Suppose 
that the coefficients £jj of M are independent, have mean zero, and 
uniformly bounded in magnitude by 1. Then there exists absolute 
constants C, c > such that 

P(||M|| op > A/n) < Ccxp(-cAn) 
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for all A > C. In particular, we have ||M|| op = 0(y/n) with over- 
whelming probability. 

Proof. From Lemma 2.3.2 and the union bound, we have 

P(||M|| op > AVn) < V(\My\ > AVn/2) 
yes 

where E is a maximal 1 /2-net of S. By Lemma 2.3.1, each of the prob- 
abilities P(|My| > A^/n/2) is bounded by Cexp(— cAn) if A is large 
enough. Meanwhile, from Lemma 2.3.4, £ has cardinality 0(l) n . If 
A is large enough, the entropy loss 16 of 0(l) n can be absorbed into 
the exponential gain of exp(— cAn) by modifying c slightly, and the 
claim follows. □ 

Exercise 2.3.2. If S is a maximal 1/4-net instead of a maximal 
1/2-net, establish the following variant 

P(||M|| op > A) < P( \/ \x*My\ > A/4) 

of Lemma 2.3.2. Use this to provide an alternate proof of Corollary 
2.3.5. 

The above result was for matrices with independent entries, but 
it easily extends to the Wigner case: 

Corollary 2.3.6 (Upper tail estimate for Wigner ensembles). Sup- 
pose that the coefficients £y of M are independent for j > i, mean 
zero, and uniformly bounded in magnitude by \, and let ^ :— £jj for 
j <i. Then there exists absolute constants C, c > such that 

P(||M|| op > A^/n) < Cexp(-c,4n) 

for all A > C. In particular, we have ||M|j op = 0(y/n) with over- 
whelming probability. 

Proof. From Corollary 2.3.5, the claim already holds for the upper- 
triangular portion of M, as well as for the strict lower-triangular 

"^Roughly speaking, the entropy of a configuration is the logarithm of the number 
of possible states that configuration can be in. When applying the union bound to 
control all possible configurations at once, one often loses a factor proportional to the 
number of such states; this factor is sometimes referred to as the entropy factor or 
entropy loss in one's argument. 
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portion of M. The claim then follows from the triangle inequality 
(adjusting the constants C, c appropriately). □ 

Exercise 2.3.3. Generalise Corollary 2.3.5 and Corollary 2.3.6 to the 
case where the coefficients ^ have uniform subgaussian tails, rather 
than being uniformly bounded in magnitude by 1. 

Remark 2.3.7. What we have just seen is a simple example of an 
epsilon net argument, which is useful when controlling a supremum 
of random variables sup xeS X x such as (2.57), where each individual 
random variable X x is known to obey a large deviation inequality (in 
this case, Lemma 2.3.1). The idea is to use metric arguments (e.g. the 
triangle inequality, see Lemma 2.3.2) to refine the set of parameters 
S to take the supremum over to an e-net £ = E e for some suitable 
e, and then apply the union bound. One takes a loss based on the 
cardinality of the e-net (which is basically the covering number of the 
original parameter space at scale e), but one can hope that the bounds 
from the large deviation inequality are strong enough (and the metric 
entropy bounds sufficiently accurate) to overcome this entropy loss. 

There is of course the question of what scale e to use. In this 
simple example, the scale e = 1/2 sufficed. In other contexts, one has 
to choose the scale e more carefully. In more complicated examples 
with no natural preferred scale, it often makes sense to take a large 
range of scales (e.g. e = 2~- 7 for j = 1, . . . , J) and chain them together 
by using telescoping series such &s X x = X Xl + J2j=i X x i+ i ~ x x, 
(where Xj is the nearest point in Ej to x for j = 1, . . . , J, and xj+i is 
x by convention) to estimate the supremum, the point being that one 
can hope to exploit cancellations between adjacent elements of the 
sequence X Xj . This is known as the method of chaining. There is an 
even more powerful refinement of this method, known as the method 
of generic chaining, which has a large number of applications; see 
[Ta2005] for a beautiful and systematic treatment of the subject. 
However, we will not use this method in this text. 

2.3.2. A symmetrisation argument (optional). We pause here 
to record an elegant symmetrisation argument that exploits convexity 
to allow us to reduce without loss of generality to the symmetric case 
M = —M, albeit at the cost of losing a factor of 2. We will not 
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use this type of argument directly in this text, but it is often used 
elsewhere in the literature. 

Let M be any random matrix with mean zero, and let M be an 
independent copy of M. Then, conditioning on M, we have 

E(M — M\M) = M. 

As the operator norm M \-+ | |M|| op is convex, we can then apply 
Jensen's inequality (Exercise 1.1.8) to conclude that 

E(||M-M|| op |M)> ||M|| op . 

Undoing the conditioning over M, wc conclude that 

(2.59) E||M-M|| op >E||M|| op . 

Thus, to upper bound the expected operator norm of M, it suffices 
to upper bound the expected operator norm of M — M. The point is 
that even if M is not symmetric (M ^ ~M), M — M is automatically 
symmetric. 

One can modify (2.59) in a few ways, given some more hypothe- 
ses on M. Suppose now that M = {£ij)i<ij< n is a matrix with 
independent entries, thus M — M has coefficients ^ — ^ where 
£ij is an independent copy of Introduce a random sign matrix 
E = ( £ ij)i<i,j<n which is (jointly) independent of M,M. Observe 
that as the distribution of ^ — is symmetric, that 

and thus 

(M -M) = (M-M)-E 
where A ■ B := (aijbij)i<i^< n is the Hadamard product of A = 
(dij)i<i,j<n an d B = (bij)i<ij<„. We conclude from (2.59) that 

E||M|| op <E||(M-M)-£;|| p. 

By the distributive law and the triangle inequality we have 

|| (M - M) ■ £|| op < \\M ■ E\\ ov + \\M- E\\ op . 

But as M ■ E = M ■ E, the quantities \\M ■ E\\ op and \\M ■ E\\ op have 
the same expectation. We conclude the symmetrisation inequality 

(2.60) E||M|| op <2E||M.£|| op . 
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Thus, if one does not mind losing a factor of two, one has the 
freedom to randomise the sign of each entry of M independently (as- 
suming that the entries were already independent). Thus, in proving 
Corollary 2.3.5, one could have reduced to the case when the £y were 
symmetric, though in this case this would not have made the argu- 
ment that much simpler. 

Sometimes it is preferable to multiply the coefficients by a Gauss- 
ian rather than by a random sign. Again, let M = (£ij)i<ij< n have 
independent entries with mean zero. Let G = (gij)i<ij< n be a real 
Gaussian matrix independent of M, thus the g^ = N(0, 1)r are 
iid. We can split G = E ■ \G\, where E :— {sgn(gij))i<i,j< n and 

= (\9ij\)i<i,j<n- Note that E, M, \G\ are independent, and E is 
a random sign matrix. In particular, (2.60) holds. We now use 

Exercise 2.3.4. If g = N(0, 1) R , show that E| 5 | = 

From this exercise we see that 

E(M • E ■ \G\\M,E) = ^M-E 

and hence by Jensen's inequality (Exercise 1.1.8) again 

E(||M • E ■ |G||| op |M, E) >\j\\\M- E\\ op . 

Undoing the conditional expectation in M, E and applying (2.60) we 
conclude the gaussian symmetrisation inequality 

(2.61) E||M|| op < v^E||M -GHop. 



Thus, for instance, when proving Corollary 2.3.5, one could have 
inserted a random Gaussian in front of each coefficient. This would 
have made the proof of Lemma 2.3.1 marginally simpler (as one could 
compute directly with Gaussians, and reduce the number of appeals 
to concentration of measure results) but in this case the improvement 
is negligible. In other situations though it can be quite helpful to 
have the additional random sign or random Gaussian factor present. 
For instance, we have the following result of Latala[La2005]: 
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Theorem 2.3.8. Let M = (£ij)i<ij< n be a matrix with independent 
mean zero entries, obeying the second moment bounds 

n 

sup^E|^| 2 < K 2 n 

1 3=1 
n 

sup^E|^| 2 <if 2 n 

J i=i 

and the fourth moment bound 

n n 

£$>|&| 4 <tfV 

i=l j=l 

for some K > 0. T/ien E||M|| op = O(K^h). 

Proof. (Sketch only) Using (2.61) one can replace &j by &j-gij with- 
out much penalty. One then runs the epsilon-net argument with an 
explicit net, and uses concentration of measure results for Gaussians 
(such as Theorem 2.1.12) to obtain the analogue of Lemma 2.3.1. 
The details are rather intricate, and we refer the interested reader to 
[La2005]. □ 

As a corollary of Theorem 2.3.8, we see that if we have an iid 
matrix (or Wigner matrix) of mean zero whose entries have a fourth 
moment of O(l), then the expected operator norm is 0(^/n). The 
fourth moment hypothesis is sharp. To see this, we make the trivial 
observation that the operator norm of a matrix M = (£jj)i<j j< n 
bounds the magnitude of any of its coefficients, thus 

sup |&,| < ||M|| op 

l<z,j<n 

or cquivalently that 

P(||M|| op < A) < P( \/ 1^1 < A). 

In the iid case ^ = £, and setting A = A^/n for some fixed A inde- 
pendent of n, we thus have 

(2.62) P(||M|| op < A^n) < P(|£| < A^f 



2.3. Operator norm 



135 



With the fourth moment hypothesis, one has from dominated conver- 
gence that 

P(|£|< AVn)>l-o A (l/n 2 ), 

and so the right-hand side of (2.62) is asymptotically trivial. But 
with weaker hypotheses than the fourth moment hypothesis, the rate 
of convergence of P(|£| < Ay/n) to 1 can be slower, and one can 
easily build examples for which the right-hand side of (2.62) is o^(l) 
for every A, which forces ||M|| op to typically be much larger than y/n 
on the average. 

Remark 2.3.9. The symmetrisation inequalities remain valid with 
the operator norm replaced by any other convex norm on the space 
of matrices. The results are also just as valid for rectangular matrices 
as for square ones. 

2.3.3. Concentration of measure. Consider a random matrix M 
of the type considered in Corollary 2.3.5 (e.g. a random sign ma- 
trix). We now know that the operator norm ||M|| op is of size 0(y/n) 
with overwhelming probability. But there is much more that can be 
said. For instance, by taking advantage of the convexity and Lips- 
chitz properties of ||M|| op , we have the following quick application of 
Talagrand's inequality (Theorem 2.1.13): 

Proposition 2.3.10. Let M be as in Corollary 2.3.5. Then for any 
A > 0, one has 

P(|||M|| op - M||M|| op | > A) < Cexp(-cA 2 ) 

for some absolute constants C,c > 0, where M||M|| op is a median 
value for ||M|| op . The same result also holds with M\\M\\ op replaced 
by the expectation E||M|| op . 

Proof. We view ||M|| op as a function F{{£ij)i<i,j<n) of the indepen- 
dent complex variables £jj , thus F is a function from C™ to R. The 
convexity of the operator norm tells us that F is convex. The triangle 
inequality, together with the elementary bound 



(2.63) 



l|M||op < \\M\\ F 
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(easily proven by Cauchy-Schwarz) , where 



n 



n 



(2.64) 



w m \\f-.= (EEi^-i 2 ) 172 



»=i j=i 



is the Frobenius norm (also known as the Hilbert- Schmidt norm or 2- 
Schatten norm), tells us that F is Lipschitz with constant 1. The 
claim then follows directly from Talagrand's inequality (Theorem 



Exercise 2.3.5. Establish a similar result for the matrices in Corol- 
lary 2.3.6. 

From Corollary 2.3.5 we know that the median or expectation 
of ||M|| op is of size 0(y/n); we now know that ||M|| op concentrates 
around this median to width at most 0(1). (This turns out to be 
non-optimal; the Tracy- Widom law actually gives a concentration of 
C^n- 1 / 6 ), under some additional assumptions on M. Nevertheless 
this level of concentration is already non-trivial.) 

However, this argument does not tell us much about what the 
median or expected value of ||M|| op actually is. For this, we will need 
to use other methods, such as the moment method which we turn to 
next. 

Remark 2.3.11. Talagrand's inequality, as formulated in Theorem 
2.1.13, relies heavily on convexity. Because of this, we cannot apply 
this argument directly to non-convex matrix statistics, such as singu- 
lar values o-j(M) other than the largest singular value a\{M). Nev- 
ertheless, one can still use this inequality to obtain good concentra- 
tion results, by using the convexity of related quantities, such as the 
partial sums J2^=i see [Me2004]. Other approaches include 

the use of alternate large deviation inequalities, such as those arising 
from log-Sobolev inequalities (see e.g. [Gu2009]), or by using more 
abstract versions of Talagrand's inequality (see [AlKrVu2002]). 

2.3.4. The moment method. We now bring the moment method 
to bear on the problem, starting with the easy moments and working 
one's way up to the more sophisticated moments. It turns out that 



2.1.13). 



□ 



2.3. Operator norm 



137 



it is easier to work first with the case when M is symmetric or Her- 
mitian; we will discuss the non-symmetric case near the end of this 
section. 

The starting point for the moment method is the observation that 
for symmetric or Hermitian M, the operator norm ||M|| op is equal to 
the l°° norm 

(2.65) ||M|| op - max \\\ 

l<i<n 

of the eigenvalues Ai, . . . , A„ € R of M. On the other hand, we have 
the standard linear algebra identity 

n 

tr(M)=^A, 

i=i 

and more generally 

i=l 

In particular, if k = 2, 4, ... is an even integer, then tr(M fc ) 1 / fc is just 
the i k norm of these eigenvalues, and we have the inequalities 

(2.66) \\M\\ k op <tv(M k )<n\\M\\ k op . 

To put this another way, knowledge of the k th moment tr(M fe ) con- 
trols the operator norm up to a multiplicative factor of n 1 ^. Taking 
larger and larger k, we should thus obtain more accurate control on 
the operator norm 17 . 

Remark 2.3.12. In most cases, one expects the eigenvalues to be 
reasonably uniformly distributed, in which case the upper bound in 
(2.66) is closer to the truth than the lower bound. One scenario in 
which this can be rigorously established is if it is known that the 
eigenvalues of M all come with a high multiplicity. This is often the 
case for matrices associated with group actions (particularly those 
which are quasirandom in the sense of Gowers[Go2008]). However, 
this is usually not the case with most random matrix ensembles, and 
we must instead proceed by increasing k as described above. 



'This is also the philosophy underlying the power method in numerical linear 
algebra. 
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Let's see how this method works in practice. The simplest case 
is that of the second moment tr(M 2 ), which in the Hermitian case 
works out to 

n n 

tr(M 2 )=^^|^| 2 = ||M|||. 
i=i j=i 

Note that (2.63) is just the k = 2 case of the lower inequality in (2.66), 
at least in the Hermitian case. 

The expression J27=i Sj=i \&j\ 2 ^ s eas y to compute in practice. 
For instance, for the symmetric Bernoulli ensemble, this expression 
is exactly equal to n 2 . More generally, if we have a Wigner matrix in 
which all off-diagonal entries have mean zero and unit variance, and 
the diagonal entries have mean zero and bounded variance (this is the 
case for instance for GOE), then the off-diagonal entries have mean 
1, and by the law of large numbers 18 we see that this expression is 
almost surely asymptotic to n 2 . 

From the weak law of large numbers, we see in particular that 
one has 

n n 

(2.67) EEl^| 2 = ( 1 + «)" 2 

i=l j=l 

asymptotically almost surely. 

Exercise 2.3.6. If the £jj have uniformly sub-exponential tail, show 
that we in fact have (2.67) with overwhelming probability. 

Applying (2.66), we obtain the bounds 

(2.68) (1 + o(l))v^ < ||M Hop < (1 + o(l))n 

asymptotically almost surely. This is already enough to show that the 
median of ||M|| op is at least (1 + o(l))y/n, which complements (up 
to constants) the upper bound of 0(y/n) obtained from the epsilon 
net argument. But the upper bound here is terrible; we will need to 
move to higher moments to improve it. 

Accordingly, we now turn to the fourth moment. For simplicity 
let us assume that all entries £y have zero mean and unit variance. 



There is of course a dependence between the upper triangular and lower tri- 
angular entries, but this is easy to deal with by folding the sum into twice the upper 
triangular portion (plus the diagonal portion, which is lower order). 
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To control moments beyond the second moment, we will also assume 
that all entries are bounded in magnitude by some K. We expand 

tr(M 4 ) = £ & 

1<*1 ,*2 ,*3 ,*4<n 

To understand this expression, we take expectations: 

Etr(M 4 ) = £ E&^^is&siMih- 

l<ii,*2,*3,«4<ro 

One can view this sum graphically, as a sum over length four cycles in 

the vertex set {1, . . . , n}; note that the four edges {ii, ^2}, {12, h}, {h, U}, {ii, h} 

are allowed to be degenerate if two adjacent & are equal. The value 

of each term 

(2-69) i 2 £i2i3£»3u£i4»i 

in this sum depends on what the cycle does. 

Firstly, there is the case when all the four edges {ii, 12}, {12, h}, {h, h}, {h, h} 
are distinct. Then the four factors £,i 1 i 2 , ■ ■ ■ ,£,i i i 1 arc independent; 
since we are assuming them to have mean zero, the term (2.69) van- 
ishes. Indeed, the same argument shows that the only terms that do 
not vanish are those in which each edge is repeated at least twice. A 
short combinatorial case check then shows that, up to cyclic permu- 
tations of the *i, «2 5 *3j *4 indices there are now only a few types of 
cycles in which the term (2.69) does not automatically vanish: 

(i) i\ = 13, but 12,14 are distinct from each other and from i\. 

(ii) i\ = is and 12 = H- 

(iii) i\ = 12 = 13, but i 4 is distinct from i\. 

(iv) H=i 2 = 13 = U- 

In the hrst case, the independence and unit variance assumptions 
tell us that (2.69) is 1, and there are 0(n 3 ) such terms, so the total 
contribution here to Etr(M 4 ) is at most 0(n 3 ). In the second case, 
the unit variance and bounded by K tells us that the term is 0(K 2 ), 
and there are 0(n 2 ) such terms, so the contribution here is 0(n 2 K 2 ). 
Similarly, the contribution of the third type of cycle is 0(n 2 ), and the 
fourth type of cycle is 0(nK 2 ), so we can put it all together to get 

Etr(M 4 ) < 0(n 3 ) + 0(n 2 K 2 ). 
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In particular, if we make the hypothesis K — 0(y/n), then we have 

Etr(M 4 ) < 0(n 3 ), 

and thus by Markov's inequality (1.13) we see that for any e > 0, 
tr(M 4 ) < £ (n 3 ) with probability at least 1 - e. Applying (2.66), 
this leads to the upper bound 

||M|| op < £ (n 3 / 4 ) 

with probability at least 1 — e; a similar argument shows that for any 
fixed e > 0, one has 

||M|| op < n 3 / 4+£ 

with high probability. This is better than the upper bound obtained 
from the second moment method, but still non-optimal. 

Exercise 2.3.7. If K = o(y/n), use the above argument to show that 
(E||M|| 4 p ) 1 / 4 >(2 1 / 4 + (l) ) ^ 

which in some sense improves upon (2.68) by a factor of 2 1 / 4 . In 
particular, if K = 0(1), conclude that the median of ||M|| op is at 
least (2 1 / 4 + o(l))Vn. 

Now let us take a quick look at the sixth moment, again with 
the running assumption of a Wigner matrix in which all entries have 
mean zero, unit variance, and bounded in magnitude by K. We have 

Etr(M 6 )= Yl E£»i»2 . . • £i B t 6 £i 6 i 1 , 

a sum over cycles of length 6 in {1, . . . , n}. Again, most of the sum- 
mands here vanish; the only ones which do not are those cycles in 
which each edge occurs at least twice (so in particular, there are at 
most three distinct edges). 

Classifying all the types of cycles that could occur here is some- 
what tedious, but it is clear that there are going to be O(l) different 
types of cycles. But we can organise things by the multiplicity of each 
edge, leaving us with four classes of cycles to deal with: 

(i) Cycles in which there are three distinct edges, each occuring 
two times. 
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(ii) Cycles in which there are two distinct edges, one occuring 
twice and one occuring four times. 

(iii) Cycles in which there are two distinct edges, each occuring 
three times 19 . 

(iv) Cycles in which a single edge occurs six times. 

It is not hard to see that summands coming from the first type of 
cycle give a contribution of 1, and there are 0(n 4 ) of these (because 
such cycles span at most four vertices). Similarly, the second and 
third types of cycles give a contribution of 0(K 2 ) per summand, and 
there are 0(n 3 ) summands; finally, the fourth type of cycle gives a 
contribution of 0(K 4 ), with 0(n 2 ) summands. Putting this together 
we see that 

Etr(M 6 ) < 0(n 4 ) + 0(n 3 K 2 ) + 0(n 2 K 4 ); 
so in particular if we assume K = 0{y/n) as before, we have 

Etr(M 6 ) < 0(n 4 ) 
and if we then use (2.66) as before we see that 

||M|| op < O e {n 2 ' 3 ) 

with probability 1 — e, for any e > 0; so we are continuing to make 
progress towards what we suspect (from the epsilon net argument) to 
be the correct bound of n 1 / 2 . 

Exercise 2.3.8. If K = o(y/n), use the above argument to show that 
(E||MpV6> (5 i/6 +o(1)) ^ 

In particular, if K = 0(1), conclude that the median of ||M|| op is 
at least (5 1 / 6 + o{\))^/n. Thus this is a (slight) improvement over 
Exercise 2.3.7. 

Let us now consider the general fc th moment computation under 
the same hypotheses as before, with k an even integer, and make 



y Actually, this case ends up being impossible, due to a "bridges of Konigsberg" 
type of obstruction, but we will retain it for this discussion. 
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some modest attempt to track the dependency of the constants on k. 
Again, we have 

(2.70) Etr(Af fe ) = £ E^ 2 ...^ i; 

l<zi ... . <n 

which is a sum over cycles of length k. Again, the only non- vanishing 
expectations are those for which each edge occurs twice; in particular, 
there are at most k/2 edges, and thus at most k/2 + 1 vertices. 

We divide the cycles into various classes, depending on which 
edges are equal to each other. (More formally, a class is an equiv- 
alence relation ~ on a set of k labels, say {1, . . . , k} in which each 
equivalence class contains at least two elements, and a cycle of k edges 
{ii, «2}, . . . , {ik, *i} lies in the class associated to ~ when we have that 
{ij, ij+i} = {ij>, ij'+i} iff j ~ f , where we adopt the cyclic notation 
ik+i ■= 

How many different classes could there be? We have to assign up 
to k/2 labels to k edges, so a crude upper bound here is (k/2) k . 

Now consider a given class of cycle. It has j edges ei, . . . , ej for 
some 1 < j < k/2, with multiplicities ai,...,aj, where a\,...,aj 
are at least 2 and add up to k. The j edges span at most j + 1 
vertices; indeed, in addition to the first vertex ii, one can specify all 
the other vertices by looking at the first appearance of each of the 
j edges ei, . . . , ej in the path from i\ to ik, and recording the final 
vertex of each such edge. From this, we see that the total number 
of cycles in this particular class is at most n J+1 . On the other hand, 
because each has mean zero, unit variance and is bounded by K, 
the a th moment of this coefficient is at most K a ~ 2 for any a > 2. 
Thus each summand in (2.70) coming from a cycle in this class has 
magnitude at most 

_ j^aiH \-a } — 2j _ j^k-2j 

Thus the total contribution of this class to (2.70) is n J ' +1 X fc_2j ', which 
we can upper bound by 

max(n5 +1 ,n 2 ^ fe - 2 ) =n fe / 2+1 max(l,^/V^) fe - 2 . 

Summign up over all classes, we obtain the (somewhat crude) bound 

Etr(M fe ) < (fc/2) fc n fe/2+1 max(l,X/V^) fe - 2 
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and thus by (2.66) 

E||Af||5p < (fc/2) fe n fe / 2+1 max(l,if/^) fe - 2 

and so by Markov's inequality (1.13) we have 

P(||M|| op > A) < \- k (k/2) k n k / 2+1 m^{l,K/Vn) k - 2 

for all A > 0. This, for instance, places the median of ||M|| op at 
O^^k^/nmsK^^K/y/n)). We can optimise this in k by choosing 
k to be comparable to logn, and so we obtain an upper bound of 
0( v / nlognmax(l, K/y/n)) for the median; indeed, a slight tweaking 
of the constants tells us that ||M|| op = 0{\/n log n max(l, K/y/n)) 
with high probability. 

The same argument works if the entries have at most unit variance 
rather than unit variance, thus we have shown 

Proposition 2.3.13 (Weak upper bound). Let M be a random Her- 
mitian matrix, with the upper triangular entries i < j being in- 
dependent with mean zero and variance at most I, and bounded in 
magnitude by K. Then ||M|| op = O (\fn logn max (1,K / y/n)) with 
high probability. 

When K < y/n, this gives an upper bound of 0{yfn\ogn) 1 which 
is still off by a logarithmic factor from the expected bound of 0{^Jn). 
We will remove this logarithmic loss later in this section. 

2.3.5. Computing the moment to top order. Now let us con- 
sider the case when K = o^n), and each entry has variance exactly 
1. We have an upper bound 

Etr(M fc ) < (k/2) k n k/2+1 ; 

let us try to get a more precise answer here (as in Exercises 2.3.7, 
2.3.8). Recall that each class of cycle contributed a bound of n : > +1 K k ~ 2: > 
to this expression. If K = o(y/n), we see that such expressions are 
Ofc(n fe / 2+1 ) whenever j < k/2, where the o k () notation means that 
the decay rate as n — > oo can depend on k. So the total contribution 
of all such classes is Ok(n k / 2+1 ). 

Now we consider the remaining classes with j = k/2. For such 
classes, each equivalence class of edges contains exactly two represen- 
tatives, thus each edge is repeated exactly once. The contribution 
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of each such cycle to (2.70) is exactly 1, thanks to the unit vari- 
ance and independence hypothesis. Thus, the total contribution of 
these classes to Etr(M fe ) is equal to a purely combinatorial quantity, 
namely the number of cycles of length k on {1, . . . , n} in which each 
edge is repeated exactly once, yielding k/2 unique edges. We are thus 
faced with the enumerative combinatorics problem of bounding this 
quantity as precisely as possible. 

With k/2 edges, there are at most k/2 + 1 vertices traversed by 
the cycle. If there are fewer than k/2+1 vertices traversed, then there 
are at most Ok(n k / 2 ) = Ok{n k / 2+1 ) cycles of this type, since one can 
specify such cycles by identifying up to k/2 vertices in {1, . . . , n} and 
then matching those coordinates with the k vertices of the cycle. So 
we set aside these cycles, and only consider those cycles which traverse 
exactly k/2 + 1 vertices. Let us call such cycles (i.e. cycles of length k 
with each edge repeated exactly once, and traversing exactly k/2 + 1 
vertices) non- crossing cycles of length k in {1, . . . , n}. Our remaining 
task is then to count the number of non-crossing cycles. 

Example 2.3.14. Let a,b,c,d be distinct elements of {l,...,n}. 
Then (ii, . . . , i§) — (a, 6, c, d, c, b) is a non-crossing cycle of length k, 
as is (a,b,a 7 c,a,d). Any cyclic permutation of a non-crossing cycle 
is again a non-crossing cycle. 

Exercise 2.3.9. Show that a cycle of length k is non-crossing if and 
only if there exists a tree 20 in {1, . . . ,n} of k/2 edges and k/2 + 1 
vertices, such that the cycle lies in the tree and traverses each edge 
in the tree exactly twice. 

Exercise 2.3.10. Let z 1; . . . be a cycle of length k. Arrange the 
integers 1, . . . , k around a circle, and draw a line segment between 
two distinct integers 1 < a < b < k whenever i a = if,. Show that 
the cycle is non-crossing if and only if the number of line segments is 
exactly k/2 — 1, and the line segments do not cross each other. This 
may help explain the terminology "non-crossing" . 

Now we can complete the count. If k is a positive even integer, 
define a Dyck word 21 of length k to be the number of words consisting 

In graph theory, a tree is a finite collection of vertices and (undirected) edges 
between vertices, which do not contain any cycles. 
21 

Dyck words are also closely related to Dyck paths in enumerative combinatorics. 
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of left and right parentheses (, ) of length k, such that when one reads 
from left to right, there are always at least as many left parentheses 
as right parentheses (or in other words, the parentheses define a valid 
nesting). For instance, the only Dyck word of length 2 is (), the two 
Dyck words of length 4 are (()) and ()(), and the five Dyck words of 
length 6 are 

()()(),(())(),()(())>(()()),((())), 

and so forth. 

Lemma 2.3.15. The number of non-crossing cycles of length k in 
{1, . . . ,n} is equal to Ck/2 n { n — 1) ... (n — k/2), where Ck/2 is the 
number of Dyck words of length k. (The number Ck/2 is also known 
as the (fc/2) th Catalan number. ) 

Proof. We will give a bijective proof. Namely, we will find a way to 
store a non-crossing cycle as a Dyck word, together with an (ordered) 
sequence of k/2 + 1 distinct elements from {1, . . . , n}, in such a way 
that any such pair of a Dyck word and ordered sequence generates 
exactly one non-crossing cycle. This will clearly give the claim. 

So, let us take a non-crossing cycle i\, . . . , ik- We imagine travers- 
ing this cycle from i\ to i2, then from «2 to 13, and so forth until we 
finally return to i\ from ik- On each leg of this journey, say from ij 
to ij+i, we either use an edge that we have not seen before, or else 
we are using an edge for the second time. Let us say that the leg 
from ij to ij+i is an innovative leg if it is in the first category, and 
a returning leg otherwise. Thus there are k/2 innovative legs and 
k/2 returning legs. Clearly, it is only the innovative legs that can 
bring us to vertices that we have not seen before. Since we have to 
visit k/2 + 1 distinct vertices (including the vertex ii we start at), 
we conclude that each innovative leg must take us to a new vertex. 
We thus record, in order, each of the new vertices we visit, starting 
at i\ and adding another vertex for each innovative leg; this is an 
ordered sequence of k/2 + 1 distinct elements of {1, . . . , n}. Next, 
traversing the cycle again, we write down a ( whenever we traverse 
an innovative leg, and an ) otherwise. This is clearly a Dyck word. 
For instance, using the examples in Example 2.3.14, the non-crossing 
cycle (a, b, c, d, c, b) gives us the ordered sequence (a, b, c, d) and the 
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Dyck word ((())), while (a,b,a,c,a,d) gives us the ordered sequence 
(a, b, c, d) and the Dyck word ()()(). 

We have seen that every non-crossing cycle gives rise to an ordered 
sequence and a Dyck word. A little thought shows that the cycle can 
be uniquely reconstructed from this ordered sequence and Dyck word 
(the key point being that whenever one is performing a returning leg 
from a vertex v, one is forced to return along the unique innovative 
leg that discovered v). A slight variant of this thought also shows 
that every Dyck word of length k and ordered sequence of k/2 + 1 
distinct elements gives rise to a non-crossing cycle. This gives the 
required bijection, and the claim follows. □ 

Next, we recall the classical formula for the Catalan number: 
Exercise 2.3.11. Establish the recurrence 



i=0 

for any n > 1 (with the convention C = 1), and use this to deduce 
that 



for all k = 2,4,6,... . 

Exercise 2.3.12. Let k be a positive even integer. Given a string of 
k/2 left parentheses and k/2 right parentheses which is not a Dyck 
word, define the reflection of this string by taking the first right paren- 
thesis which does not have a matching left parenthesis, and then re- 
versing all the parentheses after that right parenthesis. Thus, for 
instance, the reflection of ())(() is ())))(• Show that there is a bi- 
jection between non-Dyck words with k/2 left parentheses and k/2 
right parentheses, and arbitrary words with k/2 — 1 left parentheses 
and k/2 + 1 right parentheses. Use this to give an alternate proof of 



n 




(2.71) 




fc! 



(fc/2 + l)!(fc/2)! 



(2.71). 



Note that n(n - 1) . . . (n - k/2) = (1 + o k (l))n k ' 2+1 . Putting all 
the above computations together, we conclude 
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Theorem 2.3.16 (Moment computation). Let M be a real symmet- 
ric random matrix, with the upper triangular elements i < j 
jointly independent with mean zero and variance one, and bounded in 
magnitude by o(y/ri). Let k be a positive even integer. Then we have 

Etr(M fc ) = (C fe/2 + 0fc (l))n fc / 2+1 

where C k /2 is given by (2.71). 

Remark 2.3.17. An inspection of the proof also shows that if we 
allow the to have variance at most one, rather than equal to one, 
we obtain the upper bound 

Etr(M fe )<(C fc/2 +o fc (l))n fe / 2+1 . 

Exercise 2.3.13. Show that Theorem 2.3.16 also holds for Hcrmitian 
random matrices. {Hint: The main point is that with non-crossing 
cycles, each non-innovative leg goes in the reverse direction to the 
corresponding innovative leg - why?) 

Remark 2.3.18. Theorem 2.3.16 can be compared with the formula 
E^ = (^ /2 + o fe (l))n*/ 2 

derived in Notes 1, where S = X\ + ■ ■ ■ + X n is the sum of n iid 
random variables of mean zero and variance one, and 

pi 

fe/2 — 2 fc / 2 (fc/2)!' 

Exercise 2.3.10 shows that C k / 2 can be interpreted as the number of 
ways to join k points on the circle by k/2 — 1 non-crossing chords. 
In a similar vein, C' k , 2 can be interpreted as the number of ways to 
join k points on the circle by k/2 chords which are allowed to cross 
each other (except at the endpoints) . Thus moments of Wigner-type 
matrices are in some sense the "non-crossing" version of moments of 
sums of random variables. We will discuss this phenomenon more 
when we turn to free probability in Section 2.5. 

Combining Theorem 2.3.16 with (2.66) we obtain a lower bound 

E||M|| fe p > (C k/2 + o k {l))n k / 2 . 

In the bounded case K = 0(1), we can combine this with Exercise 
2.3.5 to conclude that the median (or mean) of ||M|| op is at least 
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+°fe(l))v / "- On the other hand, from Stirling's formula (Section 

l/k 

1.2) we see that C fc y 2 converges to 2 as fc — » oo. Taking fc to be a 
slowly growing function of n, we conclude 

Proposition 2.3.19 (Lower Bai-Yin theorem). Let M be a real sym- 
metric random matrix, with the upper triangular elements i < j 
jointly independent with mean zero and variance one, and bounded in 
magnitude by 0(1). Then the median (or mean) of\\M\\ op is at least 

(2-0(l))y/H. 

Remark 2.3.20. One can in fact obtain an exact asymptotic expan- 
sion of the moments Etr(M fe ) as a polynomial in n, known as the 
genus expansion of the moments. This expansion is however some- 
what difficult to work with from a combinatorial perspective (except 
at top order) and will not be used here. 

2.3.6. Removing the logarithm. The upper bound in Proposition 
2.3.13 loses a logarithm in comparison to the lower bound coming from 
Theorem 2.3.16. We now discuss how to remove this logarithm. 

Suppose that we could eliminate the Ofc(l) error in Theorem 
2.3.16. Then from (2.66) we would have 

E||M||* p < C fe/2 n fe / 2+1 

and hence by Markov's inequality (1.13) 

P(||M||op>A)<A- fe C fe/2 n fe / 2 + 1 . 

Applying this with A = (2 + e)\fn for some fixed e > 0, and setting 
k to be a large multiple of logn, we see that ||M|| op < (2 + 0(e)) y/n 
asymptotically almost surely, which on selecting e to grow slowly in 
n gives in fact that ||M|| op < (2 + o(l))y/n asymptotically almost 
surely, thus complementing the lower bound in Proposition 2.3.19. 

This argument was not rigorous because it did not address the 
Ofe(l) error. Without a more quantitative accounting of this error, one 
cannot set k as large as log n without losing control of the error terms; 
and indeed, a crude accounting of this nature will lose factors of k k 
which are unacceptable. Nevertheless, by tightening the hypotheses 
a little bit and arguing more carefully, we can get a good bound, for 
k in the region of interest: 
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Theorem 2.3.21 (Improved moment bound). Let M be a real sym- 
metric random matrix, with the upper triangular elements i < j 
jointly independent with mean zero and variance one, and bounded in 
magnitude by O(n 0A9 ) (say). Let k be a positive even integer of size 
k = 0(log 2 n) (say). Then we have 

Etr(M fc ) - C k/2 n k / 2+1 + O(k°^2 k n k / 2+0 - 9S ) 

where C k / 2 is given by (2.71). In particular, from the trivial bound 
< 2 fc (which is obvious from the Dyck words definition) one has 

(2.72) Etr(M fc ) < (2 + o{l)) k n k ' 2+l . 

One can of course adjust the parameters n 9A9 and log 2 n in the 
above theorem, but we have tailored these parameters for our appli- 
cation to simplify the exposition slightly. 

Proof. We may assume n large, as the claim is vacuous for bounded 
n. 

We again expand using (2.70), and discard all the cycles in which 
there is an edge that only appears once. The contribution of the 
non-crossing cycles was already computed in the previous section to 
be 

C k/2 n(n-l)...(n-k/2), 

which can easily be computed (e.g. by taking logarithms, or using 
Stirling's formula) to be (C k / 2 + o(l))n k / 2+1 . So the only task is to 
show that the net contribution of the remaining cycles is 0(k°^2 k n k / 2 ) 

Consider one of these cycles (i\, . . . ,i k ); it has j distinct edges 
for some 1 < j < k/2 (with each edge repeated at least once). 

We order the j distinct edges e\, . . . , ej by their first appearance 
in the cycle. Let ai, . . . ,aj be the multiplicities of these edges, thus 
the di, . . . , a, are all at least 2 and add up to k. Observe from the mo- 
ment hypotheses that the moment E|^j| a is bounded by O(n 0A9 ) a ~ 2 
for a > 2. Since ai + • • • + a } ■ = k, we conclude that the expression 

E£iii2 • • • Cifcii 

in (2.70) has magnitude at most O(n - 49 ) fc ~ 2j ' , and so the net contri- 
bution of the cycles that are not non-crossing is bounded in magnitude 
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by 

fe/2 

(2.73) ]TO(n - 49 ) fe - 2j E 

j — l ai,...,aj 

where oi,...,Oj range over integers that are at least 2 and which 
add up to k, and iV 0l is the number of cycles that arc not 
non-crossing and have j distinct edges with multiplicity a\,...,aj 
(in order of apeparance). It thus suffices to show that (2.73) is 
0(k 0(1 h k n k ^+ - 98 ). 

Next, we estimate JV 0l) ... )0j for a fixed a\, . . . , aj. Given a cycle 
(ii, . . . , ife), we traverse its k legs (which each traverse one of the edges 
ei,. . . ,ej) one at a time and classify them into various categories: 

(i) High-multiplicity legs, which use an edge ej whose multiplic- 
ity aj is larger than two. 

(ii) Fresh legs, which use an edge d with a* = 2 for the first 
time. 

(iii) Return legs, which use an edge e< with Oj = 2 that has 
already been traversed by a previous fresh leg. 

We also subdivide fresh legs into innovative legs, which take one 
to a vertex one has not visited before, and non-innovative legs, which 
take one to a vertex that one has visited before. 

At any given point in time when traversing this cycle, we define an 
available edge to be an edge a of multiplicity at = 2 that has already 
been traversed by its fresh leg, but not by its return leg. Thus, at any 
given point in time, one travels along either a high-multiplicity leg, 
a fresh leg (thus creating a new available edge), or one returns along 
an available edge (thus removing that edge from availability). 

Call a return leg starting from a vertex v forced if, at the time 
one is performing that leg, there is only one available edge from v, 
and unforced otherwise (i.e. there are two or more available edges to 
choose from). 

We suppose that there are I := #{1 < i < j ; : a% > 2} high- 
multiplicity edges among the e\, . . . , ej, leading to j — I fresh legs and 
their j — I return leg counterparts. In particular, the total number of 
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high-multiplicity legs is 

(2.74) Oi = fc - 2(j - Z). 

a«>2 

Since J2 ai >2 a i — 3^ we conclude the bound 

(2.75) I < k - 2j. 

We assume that there are m non- innovative legs among the j — I 
fresh legs, leaving j — I — m innovative legs. As the cycle is not 
non-crossing, we either have j<fc/2orm>0. 

Similarly, we assume that there are r unforced return legs among 
the j — I total return legs. We have an important estimate: 

Lemma 2.3.22 (Not too many unforced return legs). We have 

r < 2(m + ^2 at). 

a;>2 

In particular, from (2.74), (2.75), we have 

r <0{k-2j)+0(m). 

Proof. Let v be a vertex visited by the cycle which is not the initial 
vertex i\. Then the very first arrival at v comes from a fresh leg, 
which immediately becomes available. Each departure from v may 
create another available edge from v, but each subsequent arrival at 
v will delete an available leg from v, unless the arrival is along a non- 
innovative or high- multiplicity edge 22 . Finally, any returning leg that 
departs from v will also delete an available edge from v. 

This has two consequences. Firstly, if there are no non-innovative 
or high- multiplicity edges arriving at v, then whenever one arrives at 
v, there is at most one available edge from v, and so every return 
leg from v is forced. (And there will be only one such return leg.) 
If instead there are non-innovative or high-multiplicity edges arriving 
at v, then we see that the total number of return legs from v is at 
most one plus the number of such edges. In both cases, we conclude 
that the number of unforced return legs from v is bounded by twice 



Note that one can loop from v to itself and create an available edge, but this is 
along a non-innovative edge and so is not inconsistent with the previous statements. 
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the number of non- innovative or high- multiplicity edges arriving at v. 
Summing over v, one obtains the claim. □ 

Now we return to the task of counting iV 0li ... )0j , by recording 
various data associated to any given cycle (i\,...,ik) contributing 
to this number. First, fix m,r. We record the initial vertex i\ of 
the cycle, for which there are n possibilities. Next, for each high- 
multiplicity edge ej (in increasing order of i), we record all the ai 
locations in the cycle where this edge is used; the total number of 
ways this can occur for each such edge can be bounded above by k ai , 
so the total entropy cost here is k^ a i >2 ai = k k ~ 2 ^~ l \ We also record 
the final endpoint of the first occurrence of the edge d for each such 
i; this list of I vertices in {1, . . . , n} has at most n l possibilities. 

For each innovative leg, we record the final endpoint of that leg, 
leading to an additional list of j — I — m vertices with at most ni~ l ~ m 
possibilities. 

For each non-innovative leg, we record the position of that leg, 
leading to a list of m numbers from {1, . . . , k}, which has at most k m 
possibilities. 

For each unforced return leg, we record the position of the cor- 
responding fresh leg, leading to a list of r numbers from {1, . . . , k}, 
which has at most k r possibilities. 

Finally, we record a Dyck-like word of length k, in which we place 
a ( whenever the leg is innovative, and ) otherwise (the brackets need 
not match here). The total entropy cost here can be bounded above 
by 2 fe . 

We now observe that all this data (together with l,m,r) can be 
used to completely reconstruct the original cycle. Indeed, as one 
traverses the cycle, the data already tells us which edges are high- 
multiplicity, which ones are innovative, which ones are non-innovative, 
and which ones are return legs. In all edges in which one could possi- 
bly visit a new vertex, the location of that vertex has been recorded. 
For all unforced returns, the data tells us which fresh leg to backtrack 
upon to return to. Finally, for forced returns, there is only one avail- 
able leg to backtrack to, and so one can reconstruct the entire cycle 
from this data. 
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As a consequence, for fixed I, m and r, there are at most 
nk k - 2{] - l K l n 1 - l - m k m k r 2 k 
contributions to iV 0l) ... i0j ; using (2.75), (2.3.22) we can bound this by 

f,0(k-2j)+0(m) n j-m+l2k _ 

Summing over the possible values of m, r (recalling that we either 
have j < k/2 or m > 0, and also that k = 0(log 2 n)) we obtain 

N ai a < k ( k ~ 2: > )+0{1 ' ) n ma ' K{: > +l > k l 2) 2 k . 

The expression (2.73) can then be bounded by 

k/2 

2 h ^ Q( n 0.49)fc-2j fc O(fc-2j)+O(l) n max(j+l,fc/2) ^ ^ 
j — 1 ai,...,aj 

When j is exactly fc/2, then all the ai, . . . , a,j must equal 2, and so the 
contribution of this case simplifies to 2 k k°^n k / 2 . For j < k/2, the 
numbers oi — 2, . . . , aj — 2 are non-negative and add up to fc — 2j , and 
so the total number of possible values for these numbers (for fixed j) 
can be bounded crudely by j k ~ 2 -' < k k ~ 2 i (for instance). Putting all 
this together, we can bound (2.73) by 

fe/2-l 

2 k [k°^n k l 2 + O{n 0A9 ) k - 2 ik o{k - 2 rt +o ^ni +1 k k - 2 i] 

3 = 1 

which simplifies by the geometric series formula (and the hypothesis 
k = 0(log 2 n)) to 

0(2 fc fc°«n fe / 2+a98 ) 
as required. □ 

We can use this to conclude the following matching upper bound 
to Proposition 2.3.19, due to Bai and Yin[BaYil988]: 

Theorem 2.3.23 (Weak Bai- Yin theorem, upper bound). Let M — 

{£,ij)i<i,j<n be a real symmetric matrix whose entries all have the 
same distribution with mean zero, variance one, and fourth moment 
0(1). Then for every e > independent of n, one has ||M|| op < 
(2 + e) v / n asymptotically almost surely. In particular, ||M|| op < (2 + 
0(1))-^ asymptotically almost surely; as another consequence, the 
median of ||M|| op is at most (2 + o(l))y/n. (If £ is bounded, we see 
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in particular from Proposition 2.3.19 that the median is in fact equal 
to (2 + o(l))Vn.; 

The fourth moment hypothesis is best possible, as seen in the 
discussion after Theorem 2.3.8. We will discuss some generalisations 
and improvements of this theorem in other directions below. 

Proof. To obtain Theorem 2.3.23 from Theorem 2.3.21 we use the 
truncation method. We split each &j as £ij,<n - 49 + £ij,>n - 49 in the 
usual manner, and split M — M<„o.49 + M >n o.i9 accordingly. We 
would like to apply Theorem 2.3.21 to M<„o.4 9 , but unfortunately the 
truncation causes some slight adjustment to the mean and variance 
of the iij.< n a .49. The variance is not much of a problem; since £y had 
variance 1, it is clear that £ijXn 049 has variance at most 1, and it 
is easy to see that reducing the variance only serves to improve the 
bound (2.72). As for the mean, we use the mean zero nature of £,j to 
write 

E£ij,<ri - 49 = — E£jj i>n 0.49. 

To control the right-hand side, we use the trivial inequality <„o.49 1 < 
n -3x0.49|£_ |4 anc j i3 0unc ied fourth moment hypothesis to conclude 
that 

E£ij,<„o.49 = 0(n~ 1Ar ). 

Thus we can write M<„o.4 9 = M<„o.49 + EM<„o.49, where M<„o.4 9 is 
the random matrix with coefficients 

£ij,<n - 49 £ij,<n - 49 ~ ^ij,<n - 49 

and EM<„o.49 is a matrix whose entries have magnitude 0(n~ 1A7 ). In 
particular, by Schur's test this matrix has operator norm <3(n~ - 47 ), 
and so by the triangle inequality 

||M||op < IIM^^IIop + HM^o^llop + OCn- - 47 ). 

The error term O(n~ 0A7 ) is clearly negligible for n large, and it will 
suffice to show that 

(2.76) ||M<„o.49 || op < (2 + e/3)y/n 
and 

(2.77) ||M > „o.4 9 || op < 
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asymptotically almost surely. 

We first show (2.76). We can now apply Theorem 2.3.21 to con- 
clude that 

E||M<„o.4 9 ||^ p <(2 + (l)) fe n fc / 2+1 

for any k — 0(log 2 n). In particular, we see from Markov's inequality 
(1.13) that (2.76) holds with probability at most 



2 + o(iy k 



n . 



2 + e/3 

Setting to be a large enough multiple of logn (depending on e), 
we thus see that this event (2.76) indeed holds asymptotically almost 
surely 23 . 

Now we turn to (2.77). The idea here is to exploit the sparseness 
of the matrix M >n o.4<>. First let us dispose of the event that one of 
the entries ^ has magnitude larger than ^\fn (which would certainly 
cause (2.77) to fail). By the union bound, the probability of this event 
is at most 

n 2 P > £ -Vn) . 

By the fourth moment bound on £ and dominated convergence, this 
expression goes to zero as n — > oo. Thus, asymptotically almost 
surely, all entries are less than ^y/n. 

Now let us see how many non-zero entries there are in M >n o.49 . 
By Markov's inequality (1.13) and the fourth moment hypothesis, 
each entry has a probability O(n~ 4x0 ' 49 ) = 0(n~ 196 ) of being non- 
zero; by the first moment method, we see that the expected number 
of entries is O(n 04 ). As this is much less than n, we expect it to be 
unlikely that any row or column has more than one entry. Indeed, 
from the union bound and independence, we see that the probability 
that any given row and column has at least two non-zero entries is at 
most 

n 2 x 0(n- 196 ) 2 = Oin- 192 ) 

and so by the union bound again, we see that with probability at least 
1 — 0(n~ 0,92 ) (and in particular, asymptotically almost surely), none 
of the rows or columns have more than one non-zero entry. As the 



"^Indeed, one can ensure it happens with overwhelming probability, by letting 
k/ logn grow slowly to infinity. 
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entries have magnitude at most ^\/n, the bound (2.77) now follows 
from Schur's test, and the claim follows. □ 

We can upgrade the asymptotic almost sure bound to almost sure 
boundedness: 

Theorem 2.3.24 (Strong Bai-Yin theorem, upper bound). Let £ be 

a real random variable with mean zero, variance 1, and finite fourth 
moment, and for all 1 < i < j, let &j be an iid sequence with distri- 
bution £, and set £jj := Let M n :— (£ij)i<ij< n be the random 
matrix formed by the top left nxn block. Then almost surely one has 
limsup^^ \\M n \\ op /^n < 2. 

Exercise 2.3.14. By combining the above results with Proposition 
2.3.19 and Exercise 2.3.5, show that with the hypotheses of Theo- 
rem 2.3.24 with £ bounded, one has lirrin^oo ||M„|| op /\/n = 2 almost 
surely 24 . 

Proof. We first give ourselves an cpsilon of room (cf. [Ta2010, 
§2.7]). It suffices to show that for each e > 0, one has 

(2.78) limsup||M„|| op / x /n< 2 + s 

n— >oo 

almost surely. 

Next, we perform dyadic sparsification (as was done in the proof 
of the strong law of large numbers, Theorem 2.1.8). Observe that 
any minor of a matrix has its operator norm bounded by that of the 
larger matrix; and so ||M„|| op is increasing in n. Because of this, it 
will suffice to show (2.78) almost surely for n restricted to a lacunary 
sequence, such as n — n m := [(1 + e) m \ for m = 1,2,..., as the 
general case then follows by rounding n upwards to the nearest n m 
(and adjusting e a little bit as necessary). 

Once we sparsified, it is now safe to apply the Borel-Cantelli 
lemma (Exercise 1.1.1), and it will suffice to show that 

oo 

E p d! M «Jlop>(2 + £) < oo. 



The same claim is true without the boundedness hypothesis; we will see this in 
Section 2.4. 
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To bound the probabilities P ( || M„ m || p > (2+e)y/n m ), we inspect the 
proof of Theorem 2.3.23. Most of the contributions to this probability 
decay polynomially in n m (i.e. are of the form 0(n~ c ) for some 
c > 0) and so are summable. The only contribution which can cause 
difficulty is the contribution of the event that one of the entries of 
M nm exceeds fy/n^ in magnitude; this event was bounded by 

n 2 m P(\Z\ > £ -V^)- 

But if one sums over m using Fubini's theorem and the geometric 
series formula, we see that this expression is bounded by O e (E|£| 4 ), 
which is hnite by hypothesis, and the claim follows. □ 

Now we discuss some variants and generalisations of the Bai-Yin 
result. 

Firstly, we note that the results stated above require the diago- 
nal and off-diagonal terms to have the same distribution. This is not 
the case for important ensembles such as the Gaussian Orthogonal 
Ensemble (GOE), in which the diagonal entries have twice as much 
variance as the off-diagonal ones. But this can easily be handled by 
considering the diagonal separately. For instance, consider a diago- 
nal matrix D = diag(£ n , . . . ,£„„) where the = £ are identically 
distributed with finite second moment. The operator norm of this 
matrix is just sup 1<i<n \£u\, and so by the union bound 

P(||D||o P > ey/n) < «P(|£| > ey/n). 

From the finite second moment and dominated convergence, the right- 
hand side is o e (l), and so we conclude that for for every fixed s > 0, 
|| D || p < ey/n asymptotically almost surely; diagonalising, we con- 
clude that ||J5|| p = o(y/n) asymptotically almost surely. Because 
of this and the triangle inequality, we can modify the diagonal by 
any amount with identical distribution and bounded second moment 
(a similar argument also works for non-identical distributions if one 
has uniform control of some moment beyond the second, such as the 
fourth moment) while only affecting all operator norms by o(y/n). 

Exercise 2.3.15. Modify this observation to extend the weak and 
strong Bai-Yin theorems to the case where the diagonal entries are 
allowed to have different distribution than the off-diagonal terms, and 
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need not be independent of each other or of the off-diagonal terms, 
but have uniformly bounded fourth moment. 

Secondly, it is a routine matter to generalise the Bai-Yin result 
from real symmetric matrices to Hermitian matrices, basically for the 
same reasons that Exercise 2.3.13 works. We leave the details to the 
interested reader. 

The Bai-Yin results also hold for iid random matrices, where 
£ij = £ has mean zero, unit variance, and bounded fourth moment; 
this is a result of Yin, Bai, and Krishnaiah[YiBaKrl988]. Because 
of the lack of symmetry, the eigenvalues need not be real, and the 
bounds (2.66) no longer apply. However, there is a substitute, namely 
the bound 



valid for any n x n matrix M with complex entries and every even 
positive integer k. 

Exercise 2.3.16. Prove (2.79). 

It is possible to adapt all of the above moment calculations for 
tr(M fc ) in the symmetric or Hermitian cases to give analogous results 
for tr((MM*) k > 2 ) in the non-symmetric cases; we do not give the 
details here, but mention that the cycles now go back and forth along 
a bipartite graph with n vertices in each class, rather than in the 
complete graph on n vertices, although this ends up not to affect the 
cnumerative combinatorics significantly. Another way of viewing this 
is through the simple observation that the operator norm of a non- 
symmetric matrix M is equal to the operator norm of the augmented 
matrix 



which is a 2n x 2n Hermitian matrix. Thus one can to some extent 
identify an n x n iid matrix M with a 2n x 2n Wigner-type matrix 
M, in which two n x n blocks of that matrix are set to zero. 

Exercise 2.3.17. If M has singular values <7i, . . . , a n , show that M 
has eigenvalues ±<7i, . . . ,±cr„. This suggests that the theory of the 



(2.79) 



||M|| fe p < tr((MM*) k/2 ) < n\\M\\ k Q] 



(2.80) 
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singular values of an iid matrix should resemble to some extent the 
theory of eigenvalues of a Wigner matrix; we will see several examples 
of this phenomenon in later sections. 

When one assumes more moment conditions on £ than bounded 
fourth moment, one can obtain substantially more precise asymptotics 
on tr(M fc ) than given by results such as Theorem 2.3.21, particularly 
if one also assumes that the underlying random variable £ is symmet- 
ric (i.e. £ = — £). At a practical level, the advantage of symmetry is 
that it allows one to assume that the high-multiplicity edges in a cycle 
are traversed an even number of times; see the following exercise. 

Exercise 2.3.18. Let X be a bounded real random variable. Show 
that X is symmetric if and only if EX k = for all positive odd 
integers k. 

Next, extend the previous result to the case when X is subgaus- 
sian rather than bounded. (Hint: The slickest way to do this is via 
the characteristic function e ttx and analytic continuation; it is also 
instructive to find a "real-variable" proof that avoids the use of this 
function.) 

By using these methods, it is in fact possible to show that un- 
der various hypotheses, ||M|| op is concentrated in the range [2y/n — 
0{n~ x / 6 ), + 0(n -1 / 6 )], and even to get a universal distribu- 
tion for the normalised expression (||M|| op — 2y / n)n 1 / 6 , known as 
the Tracy-Widom law. See this [Sol999] for details. There has also 
been a number of subsequent variants and refinements of this result 
(as well as counterexamples when not enough moment hypotheses are 
assumed); see 25 [So2004, SoFy2005, Ru2007, Pe2006, Vu2007, 
PeSo2007, Pe2009, Kh2009, TaVu2009c]. 

2.4. The semicircular law 

We can now turn attention to one of the centerpiece universality re- 
sults in random matrix theory, namely the Wigner semicircle law for 

Similar results for some non-independent distributions arc also available, sec 
e.g. the paper [DeGi2007], which (like many of the other references cited above) 
builds upon the original work of Tracy and Widom[TrWi2002] that handled special 
ensembles such as GOE and GUE.) 
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Wigner matrices. Recall from Section 2.3 that a Wigner Hermitian 
matrix ensemble is a random matrix ensemble M n = (£ij)i<i t j< n of 
Hermitian matrices (thus &j = this includes real symmetric ma- 
trices as an important special case), in which the upper-triangular 
entries i > j are iid complex random variables with mean zero 
and unit variance, and the diagonal entries £a are iid real variables, 
independent of the upper-triangular entries, with bounded mean and 
variance. Particular special cases of interest include the Gaussian Or- 
thogonal Ensemble (GOE), the symmetric random sign matrices (aka 
symmetric Bernoulli ensemble), and the Gaussian Unitary Ensemble 



In Section 2.3 we saw that the operator norm of M n was typ- 
ically of size 0( v / n), so it is natural to work with the normalised 
matrix -^M n . Accordingly, given any n x n Hermitian matrix M n , 
we can form the (normalised) empirical spectral distribution (or ESD 
for short) 



of M n , where X\(M n ) < ... < X n (M n ) are the (necessarily real) 
eigenvalues of M n , counting multiplicity. The ESD is a probability 
measure, which can be viewed as a distribution of the normalised 
eigenvalues of M n . 

When M n is a random matrix ensemble, then the ESD /j, i Mn 

is now a random measure - i.e. a random variable 26 taking values in 
the space Pr(R) of probability measures on the real line. 

Now we consider the behaviour of the ESD of a sequence of Her- 
mitian matrix ensembles M n as n — > oo. Recall from Section 1.1 
that for any sequence of random variables in a a-compact metrisable 
space, one can define notions of convergence in probability and con- 
vergence almost surely. Specialising these definitions to the case of 
random probability measures on R, and to deterministic limits, we 
see that a sequence of random ESDs jj, i M converge in probability 
(resp. converge almost surely) to a deterministic limit \x G Pr(R) 

26 Thus, the distribution of fi_\^ M is a probability measure on probability 



(GUE). 




measures! 
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(which, confusingly enough, is a deterministic probability measure!) 
if, for every test function ip G C C (R), the quantities f R <p d\x 1 M 
converge in probability (resp. converge almost surely) to J R <p dfi. 

Remark 2.4.1. As usual, convergence almost surely implies conver- 
gence in probability, but not vice versa. In the special case of random 
probability measures, there is an even weaker notion of convergence, 
namely convergence in expectation, defined as follows. Given a ran- 
dom ESD fi i M , one can form its expectation ~Efi^_ M G Pr(R), 
defined via duality (the Riesz representation theorem) as 




this probability measure can be viewed as the law of a random eigen- 
value -^Xi(Mn) drawn from a random matrix M n from the ensem- 
ble. We then say that the ESDs converge in expectation to a limit 
fi G Pr(R) if E^i j_m„ converges the vague topology to \i, thus 




for all G C C (R). 

In general, these notions of convergence are distinct from each 
other; but in practice, one often finds in random matrix theory that 
these notions are effectively equivalent to each other, thanks to the 
concentration of measure phenomenon. 

Exercise 2.4.1. Let M n be a sequence ofnxn Hermitian matrix 
ensembles, and let ^bea continuous probability measure on R. 

(i) Show that p _j_ M converges almost surely to \i if and only 
if oo, A) converges almost surely to fi(— oo,A) for all 
A G R. 

(ii) Show that /zj_ M converges in probability to /i if and only 
if oo, A) converges in probability to oo,A) for all 
A G R. 

(iii) Show that \x i M converges in expectation to \i if and only 
if ~Efi _±_ (— oo, A) converges to oo, A) for all A G R. 



162 



2. Random matrices 



We can now state the Wigner semicircular law. 

Theorem 2.4.2 (Semicircular law). Let M n be the top left nxn mi- 
nors of an infinite Wigner matrix (£ij)ij>i- Then the ESDs jj, i M 
converge almost surely (and hence also in probability and in expecta- 
tion) to the Wigner semicircular distribution 

(2.81) Msc := ^(4-M 2 )+ 2 dx. 

The semicircular law nicely complements the upper Bai-Yin theo- 
rem (Theorem 2.3.24), which asserts that (in the case when the entries 
have finite fourth moment, at least), the matrices -^M n almost surely 
has operator norm at most 2 + o(l). Note that the operator norm is 
the same thing as the largest magnitude of the eigenvalues. Because 
the semicircular distribution (2.81) is supported on the interval [—2, 2] 
with positive density on the interior of this interval, Theorem 2.4.2 
easily supplies the lower Bai- Yin theorem, that the operator norm of 
^M„ is almost surely at least 2 — o(l), and thus (in the finite fourth 
moment case) the norm is in fact equal to 2 + o(l). Indeed, we have 
just shown that the circular law provides an alternate proof of the 
lower Bai-Yin bound (Proposition 2.3.19). 

As will become clearer in the Section 2.5, the semicircular law is 
the noncommutative (or free probability) analogue of the central limit 
theorem, with the semicircular distribution (2.81) taking on the role 
of the normal distribution. Of course, there is a striking difference 
between the two distributions, in that the former is compactly sup- 
ported while the latter is merely subgaussian. One reason for this 
is that the concentration of measure phenomenon is more powerful 
in the case of ESDs of Wigner matrices than it is for averages of iid 
variables; compare the concentration of measure results in Section 2.3 
with those in Section 2.1. 

There are several ways to prove (or at least to heuristically jus- 
tify) the circular law. In this section we shall focus on the two most 
popular methods, the moment method and the Stieltjes transform 
method, together with a third (heuristic) method based on Dyson 
Brownian motion (see Section 3.1). In Section 2.5 we shall study the 
free probability approach, and in Section 2.6 we will study the the 
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determinantal processes method approach (although this method is 
initially only restricted to highly symmetric ensembles, such as GUE) . 

2.4.1. Preliminary reductions. Before we begin any of the proofs 
of the circular law, we make some simple observations which will 
reduce the difficulty of the arguments in the sequel. 

The first observation is that the Cauchy interlacing law (Exercise 
1.3.14) shows that the ESD of ^M n is very stable in n. Indeed, we 
see from the interlacing law that 

-%J=M„(-°°,A/\/ri) - n — — < fj, i Mm (-oo, A/Vra) 

fib V 71 III V 171 

11 

< — M_i=M n (-°°, A/\/n) 
for any threshold A and any n > m > 0. 

Exercise 2.4.2. Using this observation, show that to establish the 
circular law (in any of the three senses of convergence) , it suffices to do 
so for an arbitrary lacunary sequence m, n-i, . . . of n (thus nj + i/rij > c 
for some c > 1 and all j). 

The above lacunary reduction does not help one establish conver- 
gence in probability or expectation, but will be useful 27 when estab- 
lishing almost sure convergence, as it significantly reduces the ineffi- 
ciency of the union bound. 

Next, we exploit the stability of the ESD with respect to pertur- 
bations, by taking advantage of the Weilandt- Hoffmann inequality 

n 

(2.82) ^lA^ + ^-A^I^HSH 2 , 

for Hermitian matrices A, B, where ||-B||f := (trB 2 ) 1 / 2 is the Frobe- 
nius norm(2.64) of B; see Exercise 1.3.6 or Exercise 1.3.4. We convert 
this inequality into an inequality about ESDs: 

Lemma 2.4.3. For any n x n Hermitian matrices A, B, any X, and 

any e > 0, we have 

M-L( j 4+b)(-° > A ) < M-3=(A)(-oo, A + e) + -J-^WBWp 

^Notc that a similar lacunary reduction was also used to prove the strong law 
of large numbers, Theorem 2.1.8. 
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and similarly 

H_i_ iA+B) (-oo 7 X) > n_t_ {A) (-oo,\-e) - ^^\\B\\ 2 F . 

Proof. We just prove the first inequality, as the second is similar 
(and also follows from the first, by reversing the sign of A, B). 

Let Xi(A + B) be the largest eigenvalue of A + B less than X^/n, 
and let Xj(A) be the largest eigenvalue of A less than (X + e)y/n. Our 
task is to show that 

i<j + ^-\\B\\%. 

If i < j then we are clearly done, so suppose that i > j. Then we 
have |Af(^4 + B) — Xi(A)\ > e^/n for all j < I < i, and hence 

n 

J2\* j (A + B)-X j (A)f>e 2 (j-i)n. 

The claim now follows from (2.82). □ 

This has the following corollary: 

Exercise 2.4.3 (Stability of ESD laws wrt small perturbations). Let 
M n be a sequence of random Hermitian matrix ensembles such that 
^j_ M converges almost surely to a limit /i. Let N n be another 

sequence of Hermitian random matrix ensembles such that ^2-||iV„|||, 
converges almost surely to zero. Show that M 1 (M n i N n ) converges 
almost surely to \i. 

Show that the same claim holds if "almost surely" is replaced by 
"in probability" or "in expectation" throughout. 

Informally, this exercise allows us to discard any portion of the 
matrix which is o(n) in the Frobenius norm(2.64). For instance, the 
diagonal entries of M n have a Frobenius norm of 0{\fn) almost surely, 
by the strong law of large numbers (Theorem 2.1.8). Hence, without 
loss of generality, we may set the diagonal equal to zero for the pur- 
poses of the semicircular law. 

One can also remove any component of M n that is of rank o{n): 

Exercise 2.4.4 (Stability of ESD laws wrt small rank perturbations). 
Let M n be a sequence of random Hermitian matrix ensembles such 
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that /xj_ Mn converges almost surely to a limit /i. Let N n be an- 
other sequence of random matrix ensembles such that irank(iV„) 
converges almost surely to zero. Show that (i^_( Mn+Nn ) converges 
almost surely to /j,. (Hint: use the Weyl inequalities instead of the 
Wielandt-Hoffman inequality.) 

Show that the same claim holds if "almost surely" is replaced by 
"in probability" or "in expectation" throughout. 

In a similar vein, we may apply the truncation argument (much 
as was done for the central limit theorem in Section 2.2) to reduce 
the semicircular law to the bounded case: 

Exercise 2.4.5. Show that in order to prove the semicircular law 
(in the almost sure sense), it suffices to do so under the additional 
hypothesis that the random variables are bounded. Similarly for the 
convergence in probability or in expectation senses. 

Remark 2.4.4. These facts ultimately rely on the stability of eigen- 
values with respect to perturbations. This stability is automatic in the 
Hcrmitian case, but for non-symmetric matrices, serious instabilities 
can occur due to the presence of pseudo spectrum. We will discuss this 
phenomenon more in later sections (but see also [Ta2009b, §1.5]). 

2.4.2. The moment method. We now prove the semicircular law 
via the method of moments, which we have already used several times 
in the previous sections. In order to use this method, it is convenient 
to use the preceding reductions to assume that the coefficients are 
bounded, the diagonal vanishes, and that n ranges over a lacunary 
sequence. We will implicitly assume these hypotheses throughout the 
rest of the section. 

As we have already discussed the moment method extensively, 
much of the argument here will be delegated to exercises. A full 
treatment of these computations can be found in [BaSi2010]. 

The basic starting point is the observation that the moments of 
the ESD /ij_ Mn can be written as normalised traces of powers of M n : 



(2.83) 
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In particular, on taking expectations, we have 

/ x k c?E/i j_ M (x) = E 1 tr(^M„) fc . 

From concentration of measure for the operator norm of a random 
matrix (Proposition 2.3.10), we see that the E/ij_ Mn are uniformly 
subgaussian, indeed we have 



E/i^l m {|a;| > A} < Ce 

y/n n 



for A > C, where C, c are absolute (so the decay in fact improves 
quite rapidly with n). From this and the Carleman continuity theo- 
rem (Theorem 2.2.9), we can now establish the circular law through 
computing the mean and variance of moments: 

Exercise 2.4.6. (i) Show that to prove convergence in expec- 

tation to the semicircular law, it suffices to show that 



(1) 



(2.84) E-tr(^M„) fe = / x k d Msc (x) + o k 

n \Jn J R 

for k = 1,2, . . ., where o^(l) is an expression that goes to 
zero as n — > oo for fixed k (and fixed choice of coefficient 
distribution £). 

(ii) Show that to prove convergence in probability to the semi- 
circular law, it suffices to show (2.84) together with the vari- 
ance bound 

(2.85) Var(-tr(^M„) fc ) = o k (l) 
for k = 1,2,.... 

(iii) Show that to prove almost sure convergence to the semicir- 
cular law, it suffices to show (2.84) together with the vari- 
ance bound 

(2.86) Var(- tr(^M„) fc ) = O k ( n - c *) 

n y/n 

for k = 1, 2, . . . and some c k > 0. (Note here that it is useful 
to restrict n to a lacunary sequence!) 



Ordinarily, computing second-moment quantities such as the left- 
hand side of (2.85) is harder than computing first-moment quantities 
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such as (2.84). But one can obtain the required variance bounds from 
concentration of measure: 

Exercise 2.4.7. (i) When k is a positive even integer, Use Ta- 

lagrand's inequality (Theorem 2.1.13) and convexity of the 
Schattcn norm p|| g * = (tr(A k )) 1 / k to establish (2.86) (and 
hence (2.85)) when k is even. 

(ii) For k odd, the formula \\A\\ S k = (tr(A k )) 1 / k still applies as 
long as A is positive definite. Applying this observation, 
the Bai-Yin theorem, and Talagrand's inequality to the S k 
norms of ~^M n + cl n for a constant c > 2, establish (2.86) 
(and hence (2.85)) when k is odd also. 

Remark 2.4.5. More generally, concentration of measure results 
(such as Talagrand's inequality, Theorem 2.1.13) can often be used 
to automatically upgrade convergence in expectation to convergence 
in probability or almost sure convergence. We will not attempt to 
formalise this principle here. 

It is not difficult to establish (2.86), (2.85) through the moment 
method as well. Indeed, recall from Theorem 2.3.16 of that we have 
the expected moment 

(2.87) E-tr(^M„) fc -C fe/2 + o fe (l) 

for all k = 1, 2, . . ., where the Catalan number C k / 2 is zero when k is 
odd, and is equal to 

(2 - 88) Ck ' 2 (fc/2 + l)W2)! 

for k even. 

Exercise 2.4.8. By modifying the proof of Theorem 2.3.16, show 
that 

(2.89) E\±-tr(-^=M n ) k \ 2 = C 2 k/2 + o k (l) 

and deduce (2.85). By refining the error analysis (e.g. using Theorem 
2.3.21), also establish (2.86). 
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In view of the above computations, the establishment of the semi- 
circular law now reduces to computing the moments of the semicir- 
cular distribution: 

Exercise 2.4.9. Show that for any k = 1,2,3,..., one has 



Jr 

(Hint: use a trigonometric substitution x = 2 cos 6, and then express 
the integrand in terms of Fourier phases e mB '.) 

This concludes the proof of the semicircular law (for any of the 
three modes of convergence). 

Remark 2.4.6. In the spirit of the Lindeberg exchange method, ob- 
serve that Exercise (2.4.9) is unnecessary if one already knows that 
the semicircular law holds for at least one ensemble of Wigner matri- 
ces (e.g. the GUE ensemble). Indeed, Exercise 2.4.9 can be deduced 
from such a piece of knowledge. In such a situation, it is not neces- 
sary to actually compute the main term Ck/2 on the right of (2.84); 
it would be sufficient to know that that limit is universal, in that 
it does not depend on the underlying distribution. In fact, it would 
even suffice to establish the slightly weaker statement 



whenever M n , M' n are two ensembles of Wigner matrices arising from 
different underlying distributions (but still normalised to have mean 
zero, unit variance, and to be bounded (or at worst subgaussian)). 
We will take advantage of this perspective later in this section. 

2.4.3. The Stieltjes transform method. The moment method 
was computationally intensive, but straightforward. As noted in Re- 
mark 2.4.6, even without doing much of the algebraic computation, it 
is clear that the moment method will show that some universal limit 
for Wigner matrices exists (or, at least, that the differences between 
the distributions of two different Wigner matrices converge to zero). 
But it is not easy to see from this method why the limit should be 
given by the semicircular law, as opposed to some other distribution 
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(although one could eventually work this out from an inverse moment 
computation) . 

When studying the central limit theorem, we were able to use the 
Fourier method to control the distribution of random matrices in a 
cleaner way than in the moment method. Analogues of this method 
exist, but require non-trivial formulae from noncommutativc Fourier 
analysis, such as the Harish- Chandra integration formula (and also 
only work for highly symmetric ensembles, such as GUE or GOE), 
and will not be discussed in this text 28 . 

We now turn to another method, the Stieltjes transform method, 
which uses complex-analytic methods rather than Fourier- analytic 
methods, and has turned out to be one of the most powerful and 
accurate tools in dealing with the ESD of random Hcrmitian matrices. 
Whereas the moment method started from the identity (2.83), the 
Stieltjes transform method proceeds from the identity 

/ - 1 — d/j,i M (x) = -tr \^=M n - zl) 
J R x-z n VV" / 

for any complex z not in the support of \i i M We refer to the 
expression on the left-hand side as the Stieltjes transform of M n or of 
AtJ-M„i an d denote it by s^ 1 m„ or as s n for short. The expression 
{-^M n — zl)^ 1 is the normalised resolvent of M n , and plays an im- 
portant role in the spectral theory of that matrix. Indeed, in contrast 
to general-purpose methods such as the moment method, the Stielt- 
jes transform method draws heavily on the specific linear-algebraic 
structure of this problem, and in particular on the rich structure of 
resolvents. 

On the other hand, the Stieltjes transform can be viewed as a 
generating function of the moments via the Taylor series expansion 

s„(z) = --trM„- --trM 2 -..., 

z z z n z 4 n 

valid for z sufficiently large. This is somewhat (though not exactly) 

analogous to how the characteristic function Ee ltx of a scalar random 

variable can be viewed as a generating function of the moments EX k . 



Section 2.6, however, will contain some algebraic identities related in some ways 
to the noncommutativc Fourier-analytic approach. 
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Now let us study the Stieltjes transform method more systemat- 
ically. Given any probability measure /x on the real line, we can form 
its Stieltjes transform 



L {z) := / dn{x) 
Jtl x- z 



for any z outside of the support of /x; in particular, the Stieltjes 
transform is well-defined on the upper and lower half-planes in the 
complex plane. Even without any further hypotheses on zz other than 
it is a probability measure, we can say a remarkable amount about 
how this transform behaves in z. Applying conjugations we obtain 
the symmetry 



(2.90) - « M (2) 

so we may as well restrict attention to z in the upper half-plane (say) . 
Next, from the trivial bound 

1 , 1 

< 



|Im(z)| 

one has the pointwise bound 
(2-91) K(z)\ < 

In a similar spirit, an easy application of dominated convergence gives 
the asymptotic 

(2.92) .„(,) = 1±^W 

where o M (l) is an expression that, for any fixed fi, goes to zero as 
z goes to infinity non-tangentially in the sense that | Re(z)|/| Im(z)| 
is kept bounded, where the rate of convergence is allowed to depend 
on fi. From differentiation under the integral sign (or an application 
of Morera's theorem and Fubini's theorem) we see that s^{z) is com- 
plex analytic on the upper and lower half-planes; in particular, it is 
smooth away from the real axis. From the Cauchy integral formula 
(or differentiation under the integral sign) we in fact get some bounds 
for higher derivatives of the Stieltjes transform away from this axis: 
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Informally, "behaves like a constant" at scales significantly less 
than the distance |Im(z)| to the real axis; all the really interesting 
action here is going on near that axis. 

The imaginary part of the Stieltjes transform is particularly in- 
teresting. Writing z = a + ib, we observe that 

1 b 

Im = 7 ^2 i ui > 

x — z (x — ay + b z 

and so we see that 

Im(s M (z)) > 

for z in the upper half-plane; thus s M is a complex-analytic map from 
the upper half-plane to itself, a type of function known as a Herglotz 
function 29 . 

One can also express the imaginary part of the Stieltjes transform 
as a convolution 

(2.94) lm(s^(a + ib)) = ir^i* P b (a) 
where P b is the Poisson kernel 

7r x z + b z b b 
As is well known, these kernels form a family of approximations to 
the identity, and thus //* P b converges in the vague topology to \i (see 
e.g. [Ta2010, §1.13]). Thus we see that 

Ims M (- + ib) — TTfi 

as b — > + in the vague topologyor equivalently (by (2.90)) that 30 

(2.95) s^. + ^-s^.-ib) ^ 

as b — >• + . Thus we see that a probability measure fi can be recovered 
in terms of the limiting behaviour of the Stieltjes transform on the 
real axis. 



9Q 

In fact, all complex-analytic maps from the upper half-plane to itself that obey 
the asymptotic (2.92) are of this form; this is a special case of the Herglotz repre- 
sentation theorem, which also gives a slightly more general description in the case 
when the asymptotic (2.92) is not assumed. A good reference for this material and its 
consequences is [Ga2007], 

"^Thc limiting formula (2.95) is closely related to the Plemelj formula in potential 
theory. 
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A variant of the above machinery gives us a criterion for conver- 
gence: 

Exercise 2.4.10 (Stieltjes continuity theorem). Let fi n be a sequence 
of random probability measures on the real line, and let \i be a de- 
terministic probability measure. 

(i) fi n converges almost surely to \x in the vague topology if and 
only if s Mn (z) converges almost surely to s^(z) for every z 
in the upper half-plane. 

(ii) /Lt„ converges in probability to fi in the vague topology if and 
only if s^ n (z) converges in probability to s^(z) for every z 
in the upper half-plane. 

(iii) fj, n converges in expectation to /j, in the vague topology if 
and only if Es^ (z) converges to s^(z) for every z in the 
upper half-plane. 

(Hint: The "only if" parts are fairly easy. For the "if" parts, take a 
test function 4> e C C (R) and approximate J R (j) dfj, by J R 4>* Pb d^i = 
It Jr. s p( a + ib)4>(a) da. Then approximate this latter integral in turn 
by a Ricmann sum, using (2.93).) 

Thus, to prove the semicircular law, it suffices to show that for 
each z in the upper half-plane, the Stieltjes transform 



converges almost surely (and thus in probability and in expectation) 
to the Stieltjes transform s^ sc (z) of the semicircular law. 

It is not difficult to compute the Stieltjes transform s Msc of the 
semicircular law, but let us hold off on that task for now, because 
we want to illustrate how the Stieltjes transform method can be used 
to find the semicircular law, even if one did not know this law in 
advance, by directly controlling s n (z). We will fix z = a + ib to be a 
complex number not on the real line, and allow all implied constants 
in the discussion below to depend on a and b (we will focus here only 
on the behaviour as n — > oo). 

The main idea here is predecessor comparison: to compare the 
transform s n (z) of the nxn matrix M n with the transform s„_i(z) of 
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the top left n-lxri-1 minor M n _i, or of other minors. For instance, 
we have the Cauchy interlacing law (Exercise 1.75), which asserts that 
the eigenvalues Ai(M„_i), . . . , A„_i(M„_i) of M„_i intersperse that 
of Ai(M„), . . . , A„(M„). This implies that for a complex number a+ib 
with b > 0, the difference 

n— 1 , n 7 

. ^ (\ 3 (M n - 1 )/Vn--a)2 + b2 ~ (A,(M„)/^-a)2 + 62 

is an alternating sum of evaluations of the function x ^ ( x -a) 2 +b 2 • 
The total variation of this function is O(l) (recall that we are sup- 
pressing dependence of constaants on a, b), and so the alternating 
sum above is 0(1). Writing this in terms of the Stieltjes transform, 
we conclude that 

yjn{n - l)s n -i{—^L={a + ib)) - ns n (a + ib) = 0(1). 
\/n - 1 

Applying (2.93) to approximate s„_i(^F=L(a + ib)) by s n -\(a + ib), 
we conclude that 

(2.96) s n (a + ib) = s n - 1 (a + ib) + 0{-). 

n 

So for fixed z — a + ib away from the real axis, the Stieltjes transform 
s n (z) is quite stable in n. 

This stability has the following important consequence. Observe 
that while the left-hand side of (2.96) depends on the n x n matrix 
M n , the right-hand side depends only on the top left minor M n _i 
of that matrix. In particular, it is independent of the n th row and 
column of M n . This implies that this entire row and column has only 
a limited amount of influence on the Stieltjes transform s n (a + ib): 
no matter what value one assigns to this row and column (including 
possibly unbounded values, as long as one keeps the matrix Hcrmitian 
of course), the transform s n (a + ib) can only move by 0( \ a \ f\ b \ ). 

By permuting the rows and columns, we obtain that in fact any 
row or column of M n can influence s n (a + ib) is at most O(-). (This 
is closely related to the observation in Exercise 2.4.4 that low rank 
perturbations do not significantly affect the ESD.) On the other hand, 
the rows of (the upper triangular portion of) M n are jointly indepen- 
dent. When M n is a Wigner random matrix, we can then apply a 
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standard concentration of measure result, such as McDiarmid's in- 
equality (Theorem 2.1.10) to conclude concetration of s„ around its 
mean: 

(2.97) P(|s„(a + ib) - Es„(a + ib)\> X/y/n) < Ce^ 

for all A > and some absolute constants C, c > 0. (This is not 
necessarily the strongest concentration result one can establish for the 
Stieltjes transform, but it will certainly suffice for our discussion here.) 
In particular, we see from the Borel-Cantelli lemma (Exercise 1.1.1) 
that for any fixed z away from the real line, s n (z) — Es„(z) converges 
almost surely (and thus also in probability) to zero. As a consequence, 
convergence of s n (z) in expectation automatically implies convergence 
in probability or almost sure convergence. 

However, while concentration of measure tells us that s n (z) is 
close to its mean, it does not shed much light as to what this mean 
is. For this, we have to go beyond the Cauchy interlacing formula 
and deal with the resolvent {-^=M n — zl n )~ x more directly. Firstly, 
we observe from the linearity of trace that 




where [A]jj denotes the jj component of a matrix A. Because M n is 
a Wigner matrix, it is easy to see on permuting the rows and columns 
that all of the random variables [(^=M n — zl n )~ x \jj have the same 
distribution. Thus we may simplify the above formula as 

(2.98) Es„(z) = E 

So now we have to compute the last entry of an inverse of a matrix. 
There are of course a number of formulae for this, such as Cramer's 
rule. But it will be more convenient here to use a formula based 
instead on the Schur complement: 

Exercise 2.4.11. Let A n be a n x n matrix, let A„_i be the top 
left n — 1 x n — 1 minor, let a nn be the bottom right entry of A n , 
let X € C™ -1 be the right column of A n with the bottom right entry 
removed, and let (X')* € (C" _1 )* be the bottom row with the bottom 



--Mr, 



In 



zlnV 1 
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right entry removed. In other words, 

A (A n _, X 

n \(xy a nn/ 

Assume that A n and A„_i arc both invertible. Show that 

[KlU = a nn -{X l >YA-\X- 

{Hint: Solve the equation A n v = e„, where e„ is the n th basis vector, 
using the method of Schur complements (or from first principles).) 

The point of this identity is that it describes (part of) the inverse 
of A n in terms of the inverse of A n _i, which will eventually pro- 
vide a non-trivial recursive relationship between s n (z) and s n _i(z), 
which can then be played off against (2.96) to solve for s n (z) in the 
asymptotic limit n — » oo. 

In our situation, the matrix A=M n — zl n and its minor -J=M„_i — 
zl n -i is automatically invertible. Inserting the above formula into 

(2.98) (and recalling that we normalised the diagonal of M n to van- 
ish), we conclude that 

(2.99) Es„(z) = -E — j 1 — — , 

where X e C" _1 is the top right column of M n with the bottom 
entry £„„ removed. 

One may be concerned that the denominator here could vanish. 
However, observe that z has imaginary part b if z = a + ib. Fur- 
thermore, from the spectral theorem we see that the imaginary part 
of (^M„_i — zln-i)^ 1 is positive definite, and so X*(^M n _i — 
zI n -i)~ 1 X has non-negative imaginary part. As a consequence the 
magnitude of the denominator here is bounded below by and so 
its reciprocal is 0(1) (compare with (2.91)). So the reciprocal here is 
not going to cause any discontinuity, as we are considering b is fixed 
and non-zero. 

Now we need to understand the expression X*(^A'/„_ 1 — z/„_ 1 ) _1 X. 
We write this as X*RX, where R is the resolvent matrix R :— 
(^M„_i — z/„_i) _1 . The distribution of the random matrix R could 
conceivably be quite complicated. However, the key point is that the 
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vector X only involves the entries of M n that do not lie in M„_i, and 
so the random matrix R and the vector X are independent. Because 
of this, we can use the randomness of X to do most of the work in 
understanding the expression X*RX, without having to know much 
about R at all. 

To understand this, let us first condition R to be a determin- 
istic matrix R — {rij)i<ij< n -i, and see what we can do with the 
expression X*RX. 

Firstly, observe that R will not be arbitrary; indeed, from the 
spectral theorem we see that R will have operator norm at most 
O(l). Meanwhile, from the Chernoff inequality (Theorem 2.1.3) or 
Hocflding inequality (Exercise 2.1.4) we know that X has magnitude 
0(y/n) with overwhelming probability. So we know that X*RX has 
magnitude 0(n) with overwhelming probability. 

Furthermore, we can use concentration of measure as follows. 
Given any positive semi-definite matrix A of operator norm O(l), 
the expression (X* AX) 1 / 2 = \\A 1 ! 2 X\\ is a Lipschitz function of X 
with operator norm O(l). Applying Talagrand's inequality (Theorem 
2.1.13) we see that this expression concentrates around its median: 

P(\{X*AX) 1/2 - M{X*AX) 1/2 \ > A) < Ce- cX2 

for any A > 0. On the other hand, P 1 / 2 ^ = 0(\\X\\) has magnitude 
0(\/n) with overwhelming probability, so the median M(X* AX) 1 ! 2 
must be 0{y/n). Squaring, we conclude that 

V{\X*AX - MX*AX\ > AV^) < Ce~ cX2 

(possibly after adjusting the absolute constants C, c). As usual, we 
may replace the median with the expectation: 

P(\X*AX - EX*AX\ > X^i) < Ce~ cx2 

This was for positive- definite matrices, but one can easily use the 
triangle inequality to generalise to self-adjoint matrices, and then to 
arbitrary matrices, of operator norm 1, and conclude that 

(2.100) P{\X*RX - EX*RX\ > A>/n) < Ce~ cA2 

for any deterministic matrix R of operator norm 0(1). 
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But what is the expectation EX*RX1 This can be expressed in 
components as 

n— 1 n— 1 

j=l i=l 

where £i„ are the entries of X, and are the entries of R. But the 
£i„ are iid with mean zero and variance one, so the standard second 
moment computation shows that this expectation is nothing more 
than the trace 

n-1 

tr(i?) = r« 

of R. We have thus shown the concentration of measure result 

(2.101) P{\X*RX - tr(R)\ > \Jn) < Ce^ 2 

for any deterministic matrix R of operator norm 0(1), and any A > 0. 
Informally, X* RX is typically tr(i?) + 0(Jn). 

The bound (2.101) was proven for deterministic matrices, but by 
using conditional expectation it also applies for any random matrix 
R, so long as that matrix is independent of X. In particular, we may 
apply it to our specific matrix of interest 

R := ( -Lm„_i- zln-i 

The trace of this matrix is essentially just the Stieltjes transform 
s n _i(z) at z. Actually, due to the normalisation factor being slightly 
off, we actually have 

, „. Jn ( Jn \ 

tr(i?) = n = s n -i z , 

Jn - 1 V Jn -I J 

but by using the smoothness (2.93) of the Stieltjes transform, together 
with the stability property (2.96) we can simplify this as 

ti(R) = n(s n {z) + o(l)). 

In particular, from (2.101) and (2.97), we see that 




X* RX = n(Es n (z) + o(l)) 
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with overwhelming probability. Putting this back into (2.99), and 
recalling that the denominator is bounded away from zero, we have 
the remarkable equation 

(2.102) E, n (z) = - J^i^+od). 

Note how this equation came by playing off two ways in which the 
spectral properties of a matrix M n interacted with that of its minor 
M„_i; firstly via the Cauchy interlacing inequality, and secondly via 
the Schur complement formula. 

This equation already describes the behaviour of Es„(z) quite 
well, but we will content ourselves with understanding the limiting 
behaviour as n — > oo. From (2.93) and Fubini's theorem we know 
that the function Es„ is locally uniformly cquicontinuous and locally 
uniformly bounded away from the real line. Applying the Arzeld- 
Ascoli theorem, we thus conclude that on a subsequence at least, Es„ 
converges locally uniformly to a limit s. This will be a Herglotz 
function (i.e. an analytic function mapping the upper half-plane to 
the upper half-plane), and taking limits in (2.102) (observing that the 
imaginary part of the denominator here is bounded away from zero) 
we end up with the exact equation 

(2.103) s(z) 



z + s(z) ' 

We can of course solve this by the quadratic formula, obtaining 

: ± Vz 2 - 4 2 



a(z) = - 



:±\fz 



To figure out what branch of the square root one has to use here, we 
use (2.92), which easily implies 31 that 

s(z) = i±*> 

z 

as z goes to infinity non-tangentially away from the real line. Also, we 
know that s has to be complex analytic (and in particular, continuous) 



^^To justify this, one has to make the error term in (2.92) uniform in n, but this 
can be accomplished without difficulty using the Bai-Yin theorem (for instance). 
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away from the real line. From this and basic complex analysis, we 
conclude that 

(2.104) .(,) = -z + f^ 

where \J z 1 — 4 is the branch of the square root with a branch cut at 
[—2, 2] and which equals z at infinity. 

As there is only one possible subsequence limit of the Es„, we 
conclude that Es n converges locally uniformly (and thus pointwise) 
to the function (2.104), and thus (by the concentration of measure of 
s n {z)) we see that for each z, s n (z) converges almost surely (and in 
probability) to s(z). 

Exercise 2.4.12. Find a direct proof (starting from (2.102), (2.92), 
and the smoothness of Es„(z)) that Es„(z) = s(z) + o(l) for any 
fixed z, that avoids using the Arzela-Ascoli theorem. (The basic point 
here is that one has to solve the approximate equation (2.102), using 
some robust version of the quadratic formula. The fact that Es„ is a 
Herglotz function will help eliminate various unwanted possibilities, 
such as one coming from the wrong branch of the square root.) 

To finish computing the limiting ESD of Wigncr matrices, we 
have to figure out what probability measure s comes from. But this 
is easily read off from (2.104) and (2.95): 

(2.105) s(. + ib)-s(-ib) _^1_ {4 _ x2)]/2 dx = ^ 

as b — > 0. Thus the semicircular law is the only possible measure 
which has Stieltjes transform s, and indeed a simple application of 
the Cauchy integral formula and (2.105) shows us that s is indeed the 
Stieltjes transform of fj, sc . 

Putting all this together, we have completed the Stieltjes trans- 
form proof of the semicircular law. 

Remark 2.4.7. In order to simplify the above exposition, we opted 
for a qualitative analysis of the semicircular law here, ignoring such 
questions as the rate of convergence to this law. However, an inspec- 
tion of the above arguments reveals that it is easy to make all of the 
above analysis quite quantitative, with quite reasonable control on all 
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terms . In particular, it is not hard to use the above analysis to show 
that for |Im(z)| > n~ c for some small absolute constant c > 0, one 
has s n (z) = s(z) + 0(n~ c ) with overwhelming probability. Combin- 
ing this with a suitably quantitative version of the Stieltjes continuity 
theorem, this in turn gives a polynomial rate of convergence of the 
ESDs fij_ Mn to the semicircular law ^ sc , in that one has 



with overwhelming probability for all A G R. 

A variant of this quantitative analysis can in fact get very good 
control on this ESD down to quite fine scales, namely to scales — - , 
which is only just a little bit larger than the mean spacing 0(l/n) of 
the normalised eigenvalues (recall that we have n normalised eigen- 
values, constrained to lie in the interval [—2 — o(l), 2 + o(l)] by the 
Bai-Yin theorem). This was accomplished by Erdos, Schlcin, and 
Yau[ErScYa2008] 33 by using an additional observation, namely that 
the eigenvectors of a random matrix are very likely to be delocalised 
in the sense that their £ 2 energy is dispersed more or less evenly across 
its coefficients. Such dclocalization has since proven to be a funda- 
mentally important ingredient in the fine-scale spectral analysis of 
Wigner matrices, which is beyond the scope of this text. 

2.4.4. Dyson Brownian motion and the Stieltjes transform. 

We now explore how the Stieltjes transform interacts with the Dyson 
Brownian motion (introduced in Section 3.1). We let n be a large 
number, and let M n {t) be a Wiener process of Hcrmitian random 
matrices, with associated eigenvalues Ai(t), . . . , A„(t), Stieltjes trans- 
forms 



One has to use Exercise 2.4.12 instead of the Arzcla-Ascoli theorem if one wants 
everything to be quantitative. 

"^Strictly speaking, this paper assumed additional regularity hypotheses on the 
distribution £, but these conditions can be removed with the assistance of Talagrand's 
inequality, Theorem 2.1.13. 



MJ=m„(-°°> A ) = Msc(-oo, A) + 0(n c ) 



(2.106) 
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and spectral measures 

1 " 

(2.107) M*, 

We now study how s, \i evolve in time in the asymptotic limit n — > oo. 
Our computation will be only heuristic in nature. 

Recall from Section 3.1 that the eigenvalues Aj = Xi(t) undergo 
Dyson Brownian motion 

dt 



(2.108) d\i=dBi + ^2 



. Aj — Aj 

Applying (2.106) and Taylor expansion (dropping all terms of higher 
order than dt, using the Ito heuristic dBi = O^dt 1 ' 2 )), we conclude 
that 



ds 



1 \ dBi 1 \ \dBi 



n3/ 2 ^ (A,/V^-z) 2 n 2 ^(A,/^-z) 

1 .// 



3 



n3/2 k-T' -=6- ( A *- A ;)(V\^-*) 2 ' 

For z away from the real line, the term ^ (a /^-^) 3 * S °^ S ^ Z ° 

0(dt/n) and can heuristically be ignored in the limit n — > oo. Drop- 
ping this term, and then taking expectations to remove the Brownian 
motion term dBi, we are led to 

1 X cii 

Eds = — ^372 ^Z.^ (Aj - Aj)(Aj/ -v/n - z) 2 ' 

Performing the i summation using (2.106) we obtain 



Eds = -E- V 



s(Xj I \fn)dt 



where we adopt the convention that for real x, s(x) is the average of 
s(x + iO) and s(x — iO). Using (2.107), this becomes 

(2.109) Est = -E / dft(x) 

JR ( x — z ) 
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where the t subscript denotes differentiation in t. From (2.95) we 
heuristically have 



(heuristicaily treating fi as a function rather than a measure) and on 
squaring one obtains 



From this the Cauchy integral formula around a slit in real axis (using 
the bound (2.91) to ignore the contributions near infinity) we thus 
have 



Comparing this with (2.109), we obtain 

Es t + Ess z = 0. 

From concentration of measure, we expect s to concentrate around 
its mean s := Es, and similarly s z should concentrate around s z . In 
the limit n — > oo, the expected Stieltjes transform s should thus obey 
the (complex) Burgers' equation 

(2.110) s t + ss z =0. 

To illustrate how this equation works in practice, let us give an in- 
formal derivation of the semicircular law. We consider the case when 
the Wiener process starts from M(0) — 0, thus M t = \/tG for a GUE 
matrix G. As such, we have the scaling symmetry 



where sque is the asymptotic Stieltjes transform for GUE (which we 
secretly know to be given by (2.104), but let us pretend that we did 
not yet know this fact). Inserting this self-similar ansatz into (2.110) 
and setting t = 1, we conclude that 



s(x ± iO) = s(x) ± TTi/j,(x) 



s(x ± iO) 2 = (s(x) 2 - ttVO)) ± 2iris(x)fi(x). 




and thus on differentiation in z 





1 

7, S GUE 



7, zs 'gUE + SS GUE — 0; 
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multiplying by two and integrating, we conclude that 

zsque + s 2 GUE = C 

for some constant C. But from the asymptotic (2.92) we see that 
C must equal — 1. But then the above equation can be rearranged 
into (2.103), and so by repeating the arguments at the end of the 
previous section we can deduce the formula (2.104), which then gives 
the semicircular law by (2.95). 

As is well known in PDE, one can solve Burgers' equation more 
generally by the method of characteristics. For reasons that will be- 
come clearer in Section 2.5, we now solve this equation by a slightly 
different (but ultimately equivalent) method. The idea is that rather 
than think of s — s(t, z) as a function of z for fixed t, we think 34 of 
z = z(t, s) as a function of s for fixed t. Note from (2.92) that we 
expect to be able to invert the relationship between s and z as long 
as z is large (and s is small). 

To exploit this change of perspective, we think of s, z, t as all 
varying by infinitesimal amounts ds, dz, dt respectively. Using (2.110) 
and the total derivative formula ds = s t dt + s z dz, we see that 

ds — —ss z dt + s z dz. 

If we hold s fixed (i.e. ds — 0), so that z is now just a function of t, 
and cancel off the s z factor, we conclude that 

dz 

~r = s - 
dt 

Integrating this, we see that 

(2.111) z(t, s) = z(0, s) + ts. 

This, in principle, gives a way to compute s(t,z) from s(0, z). First, 
we invert the relationship s — s(0, z) to z = z(0, s); then we add ts 
to z(0, s); then we invert again to recover s(t, z). 

since M t = M + VtG, where G is a GUE matrix independent 
of Mo, we have thus given a formula to describe the Stieltjes trans- 
form of Mo + \fiG in terms of the Stieltjes transform of M§. This 
formula is a special case of a more general formula of Voiculescu for 



^ (This trick is sometimes known as the hodograph transform, especially if one 
views s as "velocity" and z as "position" . 
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free convolution, with the operation of inverting the Stieltjes trans- 
form essentially being the famous R-transform of Voiculescu; we will 
discuss this more in the next section. 

2.5. Free probability 

In the foundations of modern probability, as laid out by Kolmogorov 
(and briefly reviewed in Section 1.1), the basic objects of study are 
constructed in the following order: 

(i) Firstly, one selects a sample space O, whose elements ui rep- 
resent all the possible states that one's stochastic system 
could be in. 

(ii) Then, one selects a a-algebra B of events E (modeled by 
subsets of fi), and assigns each of these events a probability 
P(E) G [0, 1] in a countably additive manner, so that the 
entire sample space has probability 1. 

(iii) Finally, one builds (commutative) algebras of random vari- 
ables X (such as complex-valued random variables, mod- 
eled by measurable functions from fi to C), and (assuming 
suitable integrability or moment conditions) one can assign 
expectations EX to each such random variable. 

In measure theory, the underlying measure space f2 plays a promi- 
nent foundational role, with the measurable sets and measurable func- 
tions (the analogues of the events and the random variables) always 
being viewed as somehow being attached to that space. In proba- 
bility theory, in contrast, it is the events and their probabilities that 
are viewed as being fundamental, with the sample space fl being ab- 
stracted away as much as possible, and with the random variables 
and expectations being viewed as derived concepts. See Section 1.1 
for further discussion of this philosophy. 

However, it is possible to take the abstraction process one step 
further, and view the algebra of random variables and their expec- 
tations as being the foundational concept, and ignoring both the 
presence of the original sample space, the algebra of events, or the 
probability measure. 
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There are two reasons for wanting to shed (or abstract 35 away) 
these previously foundational structures. Firstly it allows one to more 
easily take certain types of limits, such as the large n limit n — > oo 
when considering n x n random matrices, because quantities built 
from the algebra of random variables and their expectations, such as 
the normalised moments of random matrices tend to be quite stable 
in the large n limit (as we have seen in previous sections), even as the 
sample space and event space varies with n. 

Secondly, this abstract formalism allows one to generalise the clas- 
sical, commutative theory of probability to the more general theory 
of non- commutative probability theory, which does not have a classical 
underlying sample space or event space, but is instead built upon a 
(possibly) non-commutative algebra of random variables (or "observ- 
ables") and their expectations (or "traces"). This more general for- 
malism not only encompasses classical probability, but also spectral 
theory (with matrices or operators taking the role of random vari- 
ables, and the trace taking the role of expectation), random matrix 
theory (which can be viewed as a natural blend of classical probability 
and spectral theory), and quantum mechanics (with physical observ- 
ables taking the role of random variables, and their expected value 
on a given quantum state being the expectation). It is also part of 
a more general "non-commutative way of thinking" 36 (of which non- 
commutative geometry is the most prominent example), in which a 
space is understood primarily in terms of the ring or algebra of func- 
tions (or function-like objects, such as sections of bundles) placed 
on top of that space, and then the space itself is largely abstracted 
away in order to allow the algebraic structures to become less com- 
mutative. In short, the idea is to make algebra the foundation of the 



This theme of using abstraction to facilitate the taking of the large n limit also 
shows up in the application of crgodic theory to combinatorics via the correspondence 
principle; sec [Ta2009, §2.10] for further discussion. 

3^Notc that this foundational preference is to some extent a metamathematical 
one rather than a mathematical one; in many cases it is possible to rewrite the theory in 
a mathematically equivalent form so that some other mathematical structure becomes 
designated as the foundational one, much as probability theory can be equivalcntly 
formulated as the measure theory of probability measures. However, this does not 
negate the fact that a different choice of foundations can lead to a different way of 
thinking about the subject, and thus to ask a different set of questions and to discover 
a different set of proofs and solutions. Thus it is often of value to understand multiple 
foundational perspectives at once, to get a truly stereoscopic view of the subject. 
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theory, as opposed to other possible choices of foundations such as 
sets, measures, categories, etc.. 

It turns out that non-commutative probability can be modeled 
using operator algebras such as C* -algebras, von Neumann algebras, 
or algebras of bounded operators on a Hilbcrt space, with the latter 
being accomplished via the Gelfand-Naimark-Segal construction. We 
will discuss some of these models here, but just as probability theory 
seeks to abstract away its measure-theoretic models, the philosophy 
of non-commutative probability is also to downplay these operator 
algebraic models once some foundational issues are settled. 

When one generalises the set of structures in one's theory, for in- 
stance from the commutative setting to the non-commutative setting, 
the notion of what it means for a structure to be "universal" , "free" , 
or "independent" can change. The most familiar example of this 
comes from group theory. If one restricts attention to the category of 
abelian groups, then the "freest" object one can generate from two 
generators e, / is the free abelian group of commutative words e n f m 
with n, m £ Z, which is isomorphic to the group Z 2 . If however one 
generalises to the non-commutative setting of arbitrary groups, then 
the "freest" object that can now be generated from two generators 
e, / is the free group F 2 of non-commutative words e™ 1 /™ 1 . . . e Uk f m <* 
with m, mi, . . . , n k , m k <G Z, which is a significantly larger extension 
of the free abelian group Z 2 . 

Similarly, when generalising classical probability theory to non- 
commutative probability theory, the notion of what it means for two 
or more random variables to be independent changes. In the classical 
(commutative) setting, two (bounded, real-valued) random variables 
X, Y arc independent if one has 

Vf(X)g(Y) = 

whenever /, g : R — > R are well-behaved functions (such as polynomi- 
als) such that all of Ef(X), ~Eg(Y) vanishes. In the non-commutative 
setting, one can generalise the above definition to two commuting 
bounded self-adjoint variables; this concept is useful for instance in 
quantum probability, which is an abstraction of the theory of observ- 
ables in quantum mechanics. But for two (bounded, self-adjoint) 
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non- commutative random variables X, Y, the notion of classical inde- 
pendence no longer applies. As a substitute, one can instead consider 
the notion of being freely independent (or free for short), which means 
that 

Ef 1 (X)g 1 (Y)...f k (X)g k (Y) = 

whenever fi,gi, ■ ■ ■ , fk, 9k ■ R- — > R are well-behaved functions such 
that all of E/i (X), E gi (Y) , . . . , E/ fc (X) , Eg k (Y) vanish. 

The concept of free independence was introduced by Voiculescu, 
and its study is now known as the subject of free probability. We 
will not attempt a systematic survey of this subject here; for this, we 
refer the reader to the surveys of Speicher[Sp] and of Biane[Bi2003]. 
Instead, we shall just discuss a small number of topics in this area to 
give the flavour of the subject only. 

The significance of free probability to random matrix theory lies 
in the fundamental observation that random matrices which have in- 
dependent entries in the classical sense, also tend to be independent 37 
in the free probability sense, in the large n limit n — > oo. Because 
of this, many tedious computations in random matrix theory, par- 
ticularly those of an algebraic or enumerative combinatorial nature, 
can be done more quickly and systematically by using the framework 
of free probability, which by design is optimised for algebraic tasks 
rather than analytical ones. 

Much as free groups are in some sense "maximally non-commutative" , 
freely independent random variables are about as far from being com- 
muting as possible. For instance, if X, Y are freely independent and 
of expectation zero, then EXYXY vanishes, but EXXYY instead 
factors as (EA 2 )(EF 2 ). As a consequence, the behaviour of freely in- 
dependent random variables can be quite different from the behaviour 
of their classically independent commuting counterparts. Neverthe- 
less there is a remarkably strong analogy between the two types of 
independence, in that results which are true in the classically inde- 
pendent case often have an interesting analogue in the freely indepen- 
dent setting. For instance, the central limit theorem (Section 2.2) for 



This is only possible because of the highly non-commutative nature of these 
matrices; as we shall see, it is not possible for non-trivial commuting independent 
random variables to be freely independent. 
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averages of classically independent random variables, which roughly 
speaking asserts that such averages become Gaussian in the large n 
limit, has an analogue for averages of freely independent variables, 
the free central limit theorem, which roughly speaking asserts that 
such averages become semicircular in the large n limit. One can then 
use this theorem to provide yet another proof of Wigner's semicircle 
law (Section 2.4). 

Another important (and closely related) analogy is that while the 
distribution of sums of independent commutative random variables 
can be quickly computed via the characteristic function (i.e. the 
Fourier transform of the distribution), the distribution of sums of 
freely independent non-commutative random variables can be quickly 
computed using the Stieltjes transform instead (or with closely related 
objects, such as the R-transform of Voiculescu). This is strongly 
reminiscent of the appearance of the Stieltjes transform in random 
matrix theory, and indeed we will see many parallels between the use 
of the Stieltjes transform here and in Section 2.4. 

As mentioned earlier, free probability is an excellent tool for com- 
puting various expressions of interest in random matrix theory, such 
as asymptotic values of normalised moments in the large n limit 
n — > oo. Nevertheless, as it only covers the asymptotic regime in 
which n is sent to infinity while holding all other parameters fixed, 
there are some aspects of random matrix theory to which the tools of 
free probability arc not sufficient by themselves to resolve (although 
it can be possible to combine free probability theory with other tools 
to then answer these questions). For instance, questions regarding 
the rate of convergence of normalised moments as n — > oo are not di- 
rectly answered by free probability, though if free probability is com- 
bined with tools such as concentration of measure (Section 2.1) then 
such rate information can often be recovered. For similar reasons, 
free probability lets one understand the behaviour of k th moments as 
n — > oo for fixed k, but has more difficulty dealing with the situation 
in which k is allowed to grow slowly in n (e.g. k = O(logn)). Because 
of this, free probability methods are effective at controlling the bulk 
of the spectrum of a random matrix, but have more difficulty with 
the edges of that spectrum (as well as with related concepts such as 
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the operator norm, see Section 2.3) as well as with fine-scale structure 
of the spectrum. Finally, free probability methods are most effective 
when dealing with matrices that are Hermitian with bounded opera- 
tor norm, largely because the spectral theory of bounded self-adjoint 
operators in the infinite-dimensional setting of the large n limit is non- 
pathological 38 . For non-self-adjoint operators, free probability needs 
to be augmented with additional tools, most notably by bounds on 
least singular values, in order to recover the required stability for the 
various spectral data of random matrices to behave continuously with 
respect to the large n limit. We will return this latter point in Section 
2.7. 

2.5.1. Abstract probability theory. We will now slowly build up 
the foundations of non-commutative probability theory, which seeks 
to capture the abstract algebra of random variables and their expec- 
tations. The impatient reader who wants to move directly on to free 
probability theory may largely jump straight to the final definition 
at the end of this section, but it can be instructive to work with 
these foundations for a while to gain some intuition on how to handle 
non-commutative probability spaces. 

To motivate the formalism of abstract (non-commutative) prob- 
ability theory, let us first discuss the three key examples of non- 
commutative probability spaces, and then abstract away all features 
that are not shared in common by all three examples. 

Example 2.5.1 (Random scalar variables). We begin with classical 
probability theory - the study of scalar random variables. In order 
to use the powerful tools of complex analysis (such as the Stieltjes 
transform), it is very convenient to allow our random variables to 
be complex valued. In order to meaningfully take expectations, we 
would like to require all our random variables to also be absolutely 
integrable. But this requirement is not sufficient by itself to get good 
algebraic structure, because the product of two absolutely integrable 
random variables need not be absolutely integrable. As we want to 
have as much algebraic structure as possible, we will therefore restrict 
attention further, to the collection L°°~ := ff£ =1 L k {VL) °f random 



This is ultimately due to the stable nature of eigenvalues in the self-adjoint 
setting; sec [Ta2010b, §1.5] for discussion. 
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variables with all moments finite. This class is closed under multipli- 
cation, and all elements in this class have a finite trace (or expecta- 
tion). One can of course restrict further, to the space L°° — L°°(Q) 
of (essentially) bounded variables, but by doing so one loses impor- 
tant examples of random variables, most notably Gaussians, so we 
will work instead 39 with the space L°°~ . 

The space L°°~ of complex- valued random variables with all mo- 
ments finite now becomes an algebra over the complex numbers C; 
i.e. it is a vector space over C that is also equipped with a bilinear 
multiplication operation • : L°°~ x L°°~ — > L°°~ that obeys the as- 
sociative and distributive laws. It is also commutative, but we will 
suppress this property, as it is not shared by the other two examples 
we will be discussing. The deterministic scalar 1 then plays the role 
of the multiplicative unit in this algebra. 

In addition to the usual algebraic operations, one can also take 
the complex conjugate or adjoint X* = X of a complex-valued ran- 
dom variable X. This operation * : L°°~ — > L°°~ interacts well with 
the other algebraic operations: it is in fact an anti- automorphism on 
which means that it preserves addition (X + Y)* = X* +Y* , re- 
verses multiplication (XY)* = Y*X* , is anti-homogeneous ((cX)* = 
cX* for c € C), and it is invertible. In fact, it is its own inverse 
((X*)* = X), and is thus an involution. 

This package of properties can be summarised succinctly by stat- 
ing that the space L°°~ of bounded complex- valued random variables 
is a (unital) *-algebra. 

The expectation operator E can now be viewed as a map E : 
L°°~ — > C. It obeys some obvious properties, such as being linear 
(i.e. E is a linear functional on L°°). In fact it is ^-linear, which 
means that it is linear and also that ~E(X*) = EX for all X. We also 
clearly have El = 1. We will remark on some additional properties 
of expectation later. 

Example 2.5.2 (Deterministic matrix variables). A second key ex- 
ample is that of (finite-dimensional) spectral theory - the theory of 

This will cost us some analytic structure - in particular, L°°~ will not be a 
Banach space, in contrast to L°° - but as our focus is on the algebraic structure, this 
will be an acceptable price to pay. 
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n x n complex- valued matrices X £ M„(C). (One can also consider 
infinite-dimensional spectral theory, of course, but for simplicity we 
only consider the finite-dimensional case in order to avoid having to 
deal with technicalities such as unbounded operators.) Like the space 
L°°~ considered in the previous example, M n (C) is a *-algebra, where 
the multiplication operation is of course given by matrix multiplica- 
tion, the identity is the matrix identity 1 = 7„, and the involution 
X t-^ X* is given by the matrix adjoint operation. On the other hand, 
as is well-known, this *-algebra is not commutative (for n > 2). 

The analogue of the expectation operation here is the normalised 
trace r(X) := ±tvX. Thus t : M„(C) -> C is a *-linear functional 
on M„(C) that maps 1 to 1. The analogy between expectation and 
normalised trace is particularly evident when comparing the moment 
method for scalar random variables (based on computation of the 
moments EX k ) with the moment method in spectral theory (based 
on a computation of the moments r(X k )). 

Example 2.5.3 (Random matrix variables). Random matrix theory 
combines classical probability theory with finite-dimensional spectral 
theory with the random variables of interest now being the random 
matrices X £ L°°~ ® M n {C), all of whose entries have all moments 
finite. It is not hard to see that this is also a *-algebra with iden- 
tity 1 = I n , which again will be non-commutative for n > 2. The 
normalised trace t here is given by 

t(X) := E-trX, 
n 

thus one takes both the normalised matrix trace and the probabilistic 
expectation, in order to arrive at a deterministic scalar (i.e. a complex 
number). As before, we see that t : L°°~ <g> M„(C) — > C is a *- 
linear functional that maps 1 to 1. As we saw in Section 2.3, the 
moment method for random matrices is based on a computation of 
the moments r(X k ) = tr X k . 

Let us now simultaneously abstract the above three examples, 
but reserving the right to impose some additional axioms as needed: 

Definition 2.5.4 (Non-commutative probability space, preliminary 
definition). A non-commutative probability space (or more accurately, 
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a potentially non-commutative probability space) (A, r) will consist 
of a (potentially non-commutative) *-algebra A of (potentially non- 
commutative) random variables (or observables) with identity 1, to- 
gether with a trace r : A — > C, which is a *-linear functional that 
maps 1 to 1. This trace will be required to obey a number of addi- 
tional axioms which we will specify later in this section. 

This definition is not yet complete, because we have not fully 
decided on what axioms to enforce for these spaces, but for now 
let us just say that the three examples (L°° _ ,E), (M n (C) , ^ tr) , 
{L°°~ (g> M n (C), tr) given above will obey these axioms and serve 
as model examples of non-commutative probability spaces. We men- 
tion that the requirement t(1) = 1 can be viewed as an abstraction 
of Kolmogorov's axiom that the sample space has probability 1. 

To motivate the remaining axioms, let us try seeing how some 
basic concepts from the model examples carry over to the abstract 
setting. 

Firstly, we recall that every scalar random variable X e L°°~ 
has a probability distribution fix, which is a probability measure on 
the complex plane C; if X is self-adjoint (i.e. real valued), so that 
X = X* , then this distribution is supported on the real line R. The 
condition that X lie in L°°~ ensures that this measure is rapidly 
decreasing, in the sense that J c \z\ k d^x{x) < oo for all k. The 
measure fix is related to the moments r(X k ) = EX k by the formula 

(2.112) T (X k ) = f z k dfix(z) 

Jc 

for k = 0, 1, 2, . . .. In fact, one has the more general formula 

(2.113) T{X k (X*) 1 ) = f z k z l dfi x {z) 

Jc 

for fc,Z = 0, 1,2, 

Similarly, every deterministic matrix X € M„(C) has a empiri- 
cal spectral distribution \ix — ^S™=i^ i (x)j which is a probability 
measure on the complex plane C. Again, if X is self-adjoint, then 
distribution is supported on the real line R. This measure is related 
to the moments r(X k ) = ^ tr X k by the same formula (2.112) as in 
the case of scalar random variables. Because n is finite, this measure 
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is finitely supported (and in particular is rapidly decreasing). As for 
(2.113), the spectral theorem tells us that this formula holds when X 
is normal (i.e. XX* = X*X), and in particular if X is self-adjoint 
(of course, in this case (2.113) collapses to (2.112)), but is not true in 
general. Note that this subtlety does not appear in the case of scalar 
random variables because in this commutative setting, all elements 
are automatically normal. 

Finally, for random matrices X e L°°~ ® M„(C), we can form 
the expected empirical spectral distribution \i x — E^X^i^iW' 
which is again a rapidly decreasing probability measure on C, which 
is supported on R if X is self-adjoint. This measure is again related 
to the moments r(X k ) = tr X k by the formula (2.112), and also 
by (2.113) if X is normal. 

Now let us see whether we can set up such a spectral measure nx 
for an element X in an abstract non-commutative probability space 
(A,t). From the above examples, it is natural to try to define this 
measure through the formula (2.112), or equivalently (by linearity) 
through the formula 



whenever P : C — > C is a polynomial with complex coefficients (note 
that one can define P{X) without difficulty as A 7 is a *-algebra). In 
the normal case, one may hope to work with the more general formula 



whenever F:CxC^Cisa polynomial of two complex variables 
(note that P(X, X*) can be defined unambiguously precisely when X 
is normal). 

It is tempting to apply the Riesz representation theorem to (2.114) 
to define the desired measure \ix , perhaps after first using the Weier- 
strass approximation theorem to pass from polynomials to continuous 
functions. However, there are multiple technical issues with this idea: 

(i) In order for the polynomials to be dense in the continuous 
functions in the uniform topology on the support of \xx , one 
needs the intended support cr(X) of fix to be on the real 



(2.114) 




(2.115) 
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line R, or else one needs to work with the formula (2.115) 
rather than (2.114). Also, one also needs the intended sup- 
port cr(X) to be bounded for the Weierstrass approximation 
theorem to apply directly. 

(ii) In order for the Riesz representation theorem to apply, the 
functional P ^ r{P{X,X*)) (or P ^ t(P(X))) needs 
to be continuous in the uniform topology, thus one must 
be able to obtain a bound 40 of the form \t(P(X,X*))\ < 
Csup zGcr ( X ) \P(z, z)\ for some (preferably compact) set cr(X) 

(iii) In order to get a probability measure rather than a signed 
measure, one also needs some non-negativity: t(P(X, X*)) 
needs to be non-negative whenever P(z, z) > for z in the 
intended support a(X). 

To resolve the non-negativity issue, we impose an additional ax- 
iom on the non-commutative probability space (A,t): 

Axiom 2.5.5 (Non- negativity). For any X e A, we have t(X*X) > 
0. (Note that X*X is self-adjoint and so its trace t(X*X) is neces- 
sarily a real number.) 

In the language of von Neumann algebras, this axiom (together 
with the normalisation r(l) = 1) is essentially asserting that r is a 
state. Note that this axiom is obeyed by all three model examples, and 
is also consistent with (2.115). It is the noncommutative analogue of 
the Kolmogorov axiom that all events have non-negative probability. 

With this axiom, we can now define an positive semi-definite 
inner product (, )l 2 (t) on A by the formula 

(X,Y) L 2 (T) :=t(X*Y). 

This obeys the usual axioms of an inner product, except that it is only 
positive semi-definite rather than positive definite. One can impose 
positive definiteness by adding an axiom that the trace r is faithful, 
which means that t(X*X) = if and only if X = 0. However, we 
will not need the faithfulness axiom here. 



To get a probability measure, one in fact needs to have C — 1. 
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Without faithfulness, A is a semi-definite inner product space 
with semi-norm 

||X|| L2(r) := {{X,X)v {T) ?l 2 = T{X*Xf/ 2 . 

In particular, we have the Cauchy-Schwarz inequality 

\(X,Y) L 2( T) \ < ||X|| i 2( T )j|y|| L 2( T ). 

This leads to an important monotonicity: 

Exercise 2.5.1 (Monotonicity). Let X be a self-adjoint element of 
a non-commutative probability space (A,t). Show that we have the 
monotonicity relationships 

| T pf2fc-l)|l/(2k-l) < | T ( X 2fc)|l/(2fc) < | r ^2fe+2^|l/(2fe+2) 

for any k > 0. 

As a consequence, we can define the spectral radius p(X) of a 
self-adjoint element X by the formula 

(2.116) p{X) := lim \ T {X 2k )\ 1 f^, 

k—¥oo 

in which case we obtain the inequality 

(2.117) \r(X k )\ < p(X) k 

for any k = 0, 1, 2, . . .. We then say that a self-adjoint element is 
bounded if its spectral radius is finite. 

Example 2.5.6. In the case of random variables, the spectral radius 
is the essential supremum ||X|j^oo, while for deterministic matrices, 
the spectral radius is the operator norm ||X|| p. For random matri- 
ces, the spectral radius is the essential supremum || H^HoplU 00 of the 
operator norm. 

Guided by the model examples, we expect that a bounded self- 
adjoint element X should have a spectral measure px supported on 
the interval [— p(X), p(X)]. But how to show this? It turns out 
that one can proceed by tapping the power of complex analysis, and 
introducing the Stieltjes transform 

(2.118) s x (z) := t((X - z)- 1 ) 
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for complex numbers z. Now, this transform need not be defined for 
all z at present, because we do not know that X — z is invertible in 
A. However, we can avoid this problem by working formally. Indeed, 
we have the formal Neumann series expansion 

(x zr 1 - 1 x x " 

which leads to the formal Laurent series expansion 



If X is bounded self-adjoint, then from (2.117) we see that this formal 
series actually converges in the region \z\> p{X). We will thus define 
the Stieltjes transform Sx(z) on the region \z\ > p(X) by this series 
expansion (2.119), and then extend to as much of the complex plane 



We now push the domain of definition of sx(z) into the disk 
{\z\ < p(X)}. We need some preliminary lemmas. 

Exercise 2.5.2. Let X be bounded self-adjoint. For any real number 
R, show that p{R 2 + X 2 ) = R 2 +p(X) 2 . (Hint: use (2.116), (2.117)). 

Exercise 2.5.3. Let X be bounded normal. Show that |r(X fc )| < 

T((x*x) k y/ 2 < P {x*x) k / 2 . 

Now let R be a large positive real number. The idea is to rewrite 
the (formal) Stieltjes transform t((X — z)^ 1 ) using the formal identity 

(2.120) (X - z)- 1 = ((X + iR) - (z + iR))- 1 

and take Neumann series again to arrive at the formal expansion 



There could in principle be some topological obstructions to this continuation, 
but we will soon sec that the only place where singularities can occur is on the real 
interval [— p(X), p{X)] : and so no topological obstructions will appear. One can also 
work with the original definition (2.118) of the Stieltjes transform, but this requires 
imposing some additional analytic axioms on the non-commutative probability space, 
such as requiring that A be a C*-algcbra or a von Neumann algebra, and we will avoid 
discussing these topics here as they arc not the main focus of free probability theory. 



(2.119) 




as we can by analytic continuation' 
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(2.121) 
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From the previous two exercises we see that 



\r((X + iR) k )\ < (R 2 + p{X) 2 ) k ' 2 



and so the above Laurent series converges for \z + iR\ > (R 2 + 



Exercise 2.5.4. Give a rigorous proof that the two series (2.119), 
(2.121) agree for z large enough. 

We have thus extended sx (z) analytically to the region {z : \z + 
iR\ > (R 2 + p(X) 2 ) 1 / 2 }. Letting R — > oo, we obtain an extension of 
Sx(z) to the upper half-plane {z : Im(z) > 0}. A similar argument 
(shifting by — iR instead of +iR) gives an extension to the lower 
half-plane, thus defining Sx(z) analytically everywhere except on the 
interval [-p(X), p(X)]. 

On the other hand, it is not possible to analytically extend sx{z) 
to the region {z : \z\ > p(X) — e} for any < e < p(X). Indeed, if 
this were the case, then from the Cauchy integral formula (applied at 
infinity), we would have the identity 



for any R > p(X) — e, which when combined with (2.116) implies 
that p(X) < R for all such R, which is absurd. Thus the spectral 
radius p(X) can also be interpreted as the radius of the smallest ball 
centred at the origin outside of which the Stieltjes transform can be 
analytically continued. 

Now that we have the Stieltjes transform everywhere outside of 
[— p(X), p(X)\, we can use it to derive an important bound (which 
will soon be superceded by (2.114), but will play a key role in the 
proof of that stronger statement) : 

Proposition 2.5.7 (Boundedncss). Let X be bounded self-adjoint, 
and let P : C — > C be a polynomial. Then 



P (x) 2 y/ 2 . 




\r(P(X))\ < 



sup 

xe[- P (x),p(x)] 



\P(x)\. 



Proof. (Sketch) We can of course assume that P is non-constant, as 
the claim is obvious otherwise. From Exercise 2.5.3 (replacing P with 
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PP, where P is the polynomial whose coefficients are the complex 
conjugate of that of P) we may reduce to the case when P has real 
coefficients, so that P(X) is self-adjoint. Since X is bounded, it is 
not difficult (using (2.116), (2.117)) to show that P(X) is bounded 
also (Exercise!). 

As P(X) is bounded self-adjoint, it has a Stieltjes transform de- 
fined outside of [— p(P(X)), p(P(X))], which for large z is given by 
the formula 

(2-122) apw(z) = -f;i(W). 

k=0 

By the previous discussion, to establish the proposition it will suffice 
to show that the Stieltjes transform can be continued to the domain 

n-.= {zeC:z> sup \P(x)\}. 

xe[-p{x), P {x)] 

For this, we observe the partial fractions decomposition 



i = £ p'icr 1 



P(w) — z ^ w — C 

of (P(w) — z)^ 1 into linear combinations of (w — £) , at least when 
the roots of P — z are simple. Thus, formally, at least, we have the 
identity 

(:P(0=z ^ 

One can verify this identity is consistent with (2.122) for z sufficiently 
large. (Exercise! Hint: First do the case when X is a scalar, then 
expand in Taylor series and compare coefficients, then use the agree- 
ment of the Taylor series to do the general case.) 

If z is in the domain f2, then all the roots ( of P(() — z lie 
outside the interval [— p{X), p(X)\. So we can use the above formula 
as a definition of Sp(x) (z), at least for those z £ il for which the roots 
of P — z are simple; but there are only finitely many exceptional z 
(arising from zeroes of P') and one can check (Exercise! Hint: use 
the analytic nature of sx and the residue theoremto rewrite parts 
of s P(x)( z ) as a contour integral.) that the singularities here are 
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removable. It is easy to see (Exercise!) that sp(x) is holomorphic 
outside of these removable singularities, and the claim follows. □ 

Exercise 2.5.5. Fill in the steps marked (Exercise!) in the above 
proof. 

From Proposition 2.5.7 and the Weierstrass approximation the- 
orem (see e.g. [Ta2010, §1.10]), we see that the linear functional 
P i ^ t(P(X)) can be uniquely extended to a bounded linear func- 
tional on C([—p(X),p(X)]), with an operator norm 1. Applying the 
Riesz representation theorem (see e.g. [Ta2010, §1.10]), we thus can 
find a unique Radon measure (or equivalently, Borel measure) \ix on 
[— p(X),p(X)] of total variation 1 obeying the identity (2.114) for all 
P. In particular, setting P = 1 see that /ix has total mass 1; since it 
also has total variation 1, it must be a probability measure. We have 
thus shown the fundamental 

Theorem 2.5.8 (Spectral theorem for bounded self-adjoint elements). 
Let X be a bounded self-adjoint element of a non-commutative proba- 
bility space (A,t). Then there exists a unique Borel probability mea- 
sure /ix on [— p(X), p(X)] (known as the spectral measure of X) such 
that (2.114) holds for all polynomials P : C — > C. 

Remark 2.5.9. If one assumes some completeness properties of the 
non-commutative probability space, such as that A is a C*-algebra 
or a von Neumann algebra, one can use this theorem to meaningfully 
define F(X) for other functions F : [— p(X),p(X)] — > C than poly- 
nomials; specifically, one can do this for continuous functions F if A 
is a C*-algebra, and for L°°(^x) functions F if A is a von Neumann 
algebra. Thus for instance we can start define absolute values \X\, or 
square roots | J£T| 1 ' /2 , etc.. Such an assignment F i-> F(X) is known 
as a functional calculus; it can be used for instance to go back and 
make rigorous sense of the formula (2.118). A functional calculus is 
a very convenient tool to have in operator algebra theory, and for 
that reason one often completes a non-commutative probability space 
into a C* -algebra or von Neumann algebra, much as how it is often 
convenient to complete the rationals and work instead with the reals. 
However, we will proceed here instead by working with a (possibly in- 
complete) non-commutative probability space, and working primarily 
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with formal expressions (e.g. formal power series in z) without trying 
to evaluate such expressions in some completed space. We can get 
away with this because we will be working exclusively in situations in 
which the spectrum of a random variable can be reconstructed exactly 
from its moments (which is in particular true in the case of bounded 
random variables). For unbounded random variables, one must usu- 
ally instead use the full power of functional analysis, and work with 
the spectral theory of unbounded operators on Hilbcrt spaces. 

Exercise 2.5.6. Let X be a bounded self-adjoint element of a non- 
commutative probability space, and let px as the spectral measure 
of X. Establish the formula 



for all z e C\[-p(X),p(X)]. Conclude that the support 42 of the 
spectral measure px must contain at least one of the two points 



Exercise 2.5.7. Let X be a bounded self-adjoint element of a non- 
commutative probability space with faithful trace. Show that p(X) = 
if and only if X = 0. 

Remark 2.5.10. It is possible to also obtain a spectral theorem for 
bounded normal elements along the lines of the above theorem (with 
px now supported in a disk rather than in an interval, and with 
(2.114) replaced by (2.115)), but this is somewhat more complicated 
to show (basically, one needs to extend the self-adjoint spectral the- 
orem to a pair of commuting self-adjoint elements, which is a little 
tricky to show by complex-analytic methods, as one has to use several 
complex variables). 

The spectral theorem more or less completely describes the be- 
haviour of a single (bounded self-adjoint) element X in a non-commutative 
probability space. As remarked above, it can also be extended to 
study multiple commuting self-adjoint elements. However, when one 
deals with multiple non-commuting elements, the spectral theorem 
becomes inadequate (and indeed, it appears that in general there is 




p(X),p(X). 



The support of a measure is the intersection of all the closed sets of full measure. 
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no usable substitute for this theorem). However, we can begin mak- 
ing a little bit of headway if we assume as a final (optional) axiom a 
very weak form of commutativity in the trace: 

Axiom 2.5.11 (Trace). For any two elements X, Y, we have t(XY) = 
t{YX). 

Note that this axiom is obeyed by all three of our model examples. 
From this axiom, we can cyclically permute products in a trace, e.g. 
t(XYZ) = t(YZX) = t(ZXY). However, we cannot take non- 
cyclic permutations; for instance, t(XYZ) and t(XZY) are distinct 
in general. This axiom is a trivial consequence of the commutative 
nature of the complex numbers in the classical setting, but can play 
a more non-trivial role in the non-commutative setting. It is however 
possible to develop a large part of free probability without this axiom, 
if one is willing instead to work in the category of von Neumann 
algebras. Thus, we shall leave it as an optional axiom: 

Definition 2.5.12 (Non-commutative probability space, final defi- 
nition). A non-commutative probability space (A, r) consists of a *- 
algebra A with identity 1, together with a *-linear functional r : 
A — > C, that maps 1 to 1 and obeys the non-negativity axiom. If r 
obeys the trace axiom, we say that the non-commutative probability 
space is tracial. If r obeys the faithfulness axiom, we say that the 
non- commutative probability space is faithful. 

From this new axiom and the Cauchy-Schwarz inequality we can 
now get control on products of several non-commuting elements: 

Exercise 2.5.8. Let Xi, . . . , X^ be bounded self-adjoint elements of 
a tracial non-commutative probability space (A, t). Show that 

|r(Xp . . . X^)\ < p{X 1 ) m - . . . p{X k ) m « 

for any non- negative integers mi,... ,mh- {Hint: Induct on k, and 
use Cauchy-Schwarz to split up the product as evenly as possible, 
using cyclic permutations to reduce the complexity of the resulting 
expressions.) 

Exercise 2.5.9. Let A^L°°(t) be those elements X in a tracial non- 
commutative probability space (A, r) whose real and imaginary parts 
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Re(X) := X+ 2 X , lm(X) := x ~f are bounded and self-adjoint; we 
refer to such elements simply as bounded elements. Show that this is 
a sub-*-algebra of A. 

This allows one to perform the following Gelfand-Naimark- Segal 
(GNS) construction. Recall that .An L°°(r) has a positive semi- 
definite inner product (,)l 2 m- We can perform the Hilbert space 
completion of this inner product space (quotienting out by the ele- 
ments of zero norm), leading to a complex Hilbert space L 2 (t) into 
which Ani°°(T) can be mapped as a dense subspace by an isometry 43 
i: AC\L°°(t)^ L 2 (t). 

The space A n L°° (r) acts on itself by multiplication, and thus 
also acts on the dense subspace t(An L°° (r)) of L 2 (t). We would like 
to extend this action to all of L 2 (t), but this requires an additional 
estimate: 

Lemma 2.5.13. Let (A, r) be a tracial non- commutative probability 
space. If X,Y £ An L°°(t) with X self-adjoint, then 

\\xy\\ LHt) < p(X)\\Y\\ LHr) . 

Proof. Squaring and cyclically permuting, it will suffice to show that 
t(Y*X 2 Y) < P (X) 2 t(Y*Y). 

Let e > be arbitrary. By Weierstrass approximation, we can 
find a polynomial P with real coefficients such that x 2 + P(x) 2 = 
p(X) 2 + 0(e) on the interval [-p(X), p(X)]. By Proposition 2.5.7, 
we can thus write X 2 + P(X) 2 = p(X) 2 + E where E is self-adjoint 
with p(E) = 0(e). Multiplying on the left by Y* and on the right by 
Y and taking traces, we obtain 

t(Y*X 2 Y) + t(Y*P(X) 2 Y) < p(X) 2 r(Y*Y) + t(Y*EY). 

By non-negativity, t(Y* P(X) 2 Y) > 0. By Exercise 2.5.8, we have 
t(Y*EY) = Y (e). Sending £->0we obtain the claim. □ 

As a consequence, we see that the self-adjoint elements X of 
An L°°(t) act in a bounded manner on all of L 2 (t), and so on taking 



J This isometry is injective when A is faithful, but will have a non-trivial kernel 
otherwise. 
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real and imaginary parts, we see that the same is true for the non- 
self-adjoint elements too. Thus we can associate to each X € L°°(r) 
a bounded linear transformation X e B(L 2 {t)) on the Hilbert space 
L\t). 

Exercise 2.5.10 (Gclfand-Naimark theorem). Show that the map 
X i ^ X is a ^-isomorphism from A n L°° (r) to a *-subalgebra of 
B(L 2 (t)), and that one has the representation 

r(X) = {e,Xe) 
for any X e L°°{t), where e is the unit vector e := 

Remark 2.5.14. The Gelfand-Naimark theorem required the tracial 
hypothesis only to deal with the error E in the proof of Lemma 2.5.13. 
One can also establish this theorem without this hypothesis, by as- 
suming instead that the non-commutative space is a C* -algebra; this 
provides a continuous functional calculus, so that we can replace P 
in the proof of Lemma 2.5.13 by a continuous function and dispense 
with E altogether. This formulation of the Gelfand-Naimark theorem 
is the one which is usually seen in the literature. 

The Gelfand-Naimark theorem identifies A n L°° (t) with a *- 
subalgebra of B(L 2 (t)). The closure of this *-subalgebra in the weak 
operator topology 44 is then a von Neumann algebra, which we denote 
as L°°(t). As a consequence, we see that non-commutative proba- 
bility spaces are closely related to von Neumann algebras (equipped 
with a tracial state r). However, we refrain from identifying the for- 
mer completely with the latter, in order to allow ourselves the freedom 
to work with such spaces as L°°~, which is almost but not quite a 
von Neumann algebra. Instead, we use the following looser (and more 
algebraic) definition in Definition 2.5.12. 

2.5.2. Limits of non-commutative random variables. One ben- 
efit of working in an abstract setting is that it becomes easier to take 
certain types of limits. For instance, it is intuitively obvious that the 
cyclic groups Z/NZ are "converging" in some sense to the integer 



The weak operator topology on the space B(H) of bounded operators on a 
Hilbert space is the weakest topology for which the coefficient maps T i— >- (Tu, v) h arc 
continuous for each u,v £ H . 



204 



2. Random matrices 



group Z. This convergence can be formalised by selecting a distin- 
guished generator e of all groups involved (1 mod N in the case of 
Z/NZ, and 1 in the case of the integers Z), and noting that the set of 
relations involving this generator in Z/NZ (i.e. the relations ne = 
when n is divisible by N) converge in a pointwise sense to the set 
of relations involving this generator in Z (i.e. the empty set). Here, 
to see the convergence, we viewed a group abstractly via the rela- 
tions between its generators, rather than on a concrete realisation of 
a group as (say) residue classes modulo N. 

We can similarly define convergence of random variables in non- 
commutative probability spaces as follows. 

Definition 2.5.15 (Convergence). Let (A n ,T n ) be a sequence of non- 
commutative probability spaces, and let (^oo)foo) be an additional 
non-commutative space. For each n, let X nj \, . . . , X n _ k be a sequence 
of random variables in A n , and let -X^i, . . . , Xoo,k be a sequence of 
random variables in Aoo- We say that X n _i, . . . , X n ^ converges in 
the sense of moments to Xoo,i, . . . , Xoo,k if we have 

T~n(,X n ^ 1 . . . X n i m ) ^ T 00 {X 00 ^ l . . ■X 00 i m } 

as n — > oo for any sequence ii,...,i m <E {1, . . . , fc}. We say that 
X n< i, . . . , X n _ k converge in the sense of '* -moments to Xoo,i, . . . , Xoo,k 
if X n _i, . . . , X n ^, JT* j,..., X* k converges in the sense of moments to 

Y \' Y* V* 

If Xi,...,Xk (viewed as a constant fc-tuple in n) converges in 
the sense of moments (resp. *-moments) to Yi, . . . , Y k , we say that 
Xi , . . . , Xk and Yi , . . . , Y& have matching joint moments (resp. match- 
ing joint ^-moments). 

Example 2.5.16. If X n ,Y n converge in the sense of moments to 
Xoo , Yoo then we have for instance that 

T n (X n Y n X n ) -> Too (Xqo ^oo Xoo) 

as tl — y oo for each k, while if they converge in the stronger sense of 
*-moments then we obtain more limits, such as 

T n {X n Y n X*) -> TooiXooY^X^). 



2.5. Free probability 



205 



Note however that no uniformity in k is assumed for this convergence; 
in particular, if k varies in n (e.g. if k = O(logn)), there is now no 
guarantee that one still has convergence. 

Remark 2.5.17. When the underlying objects X n ^, . . . , X n ^ an d 
Xi, . . . , Xk are self-adjoint, then there is no distinction between con- 
vergence in moments and convergence in ^-moments. However, for 
non-self-adjoint variables, the latter type of convergence is far stronger, 
and the former type is usually too weak to be of much use, even in the 
commutative setting. For instance, let X be a classical random vari- 
able drawn uniformly at random from the unit circle {zGC:|z| = l}. 
Then the constant sequence X n = X has all the same moments as the 
zero random variable 0, and thus converges in the sense of moments 
to zero, but does not converge in the *-moment sense to zero. 

It is also clear that if we require that Aoo be generated by -X^i, . . . , 
in the ^-algebraic sense (i.e. every element of Aoo is a polynomial 
combination of Xoo,i, . . . , Xoo,k and their adjoints) then a limit in 
the sense of ^-moments, if it exists, is unique up to matching joint 
♦-moments. 

For a sequence X n of a single, uniformly bounded, self-adjoint 
element, convergence in moments is equivalent to convergence in dis- 
tribution: 

Exercise 2.5.11. Let X n e A n be a sequence of self-adjoint elements 
in non-commutative probability spaces (A n , r„) with p(X n ) uniformly 
bounded, and let X^o e Aca be another bounded self-adjoint element 
in a non-commutative probability space (Ac*, , ) . Show that X n 
converges in moments to if and only if the spectral measure px n 
converges in the vague topology to ■ 

Thus, for instance, one can rephrase the Wigner semicircular 
law (in the convergence in expectation formulation) as the asser- 
tion that a sequence M n € L°°~ <£> M„(C) of Wigner random ma- 
trices with (say) subgaussian entries of mean zero and variance one, 
when viewed as elements of the non-commutative probability space 
(L°°~ ® M„(C), E- tr), will converge to any bounded self-adjoint ele- 
ment u of a non-commutative probability space with spectral measure 
given by the semicircular distribution ^ sc := ^(4 — x 2 ) 1 / 2 dx. Such 
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elements are known as semicircular elements. Here are some easy 
examples of semicircular elements: 

(i) A classical real random variable u drawn using the proba- 
bility measure /i sc . 

(ii) The identity function x i-> x in the Lebesgue space L°° (d/j sc ), 
endowed with the trace r(/) := J R f dp sc . 

(iii) The function 9 i-> 2 cos 6 in the Lebesgue space L°° ( [0, 7r] , ^ sin 2 
Here is a more interesting example of a semicircular element: 

Exercise 2.5.12. Let (-4, r) be the non-commutative space consist- 
ing of bounded operators B(l 2 (N)) on the natural numbers with 
trace t(X) := (e 0l Ieo)p( N ), where eo,ei,... is the standard ba- 
sis of £ 2 (N). Let U : e n ^ e n+1 be the right shift on ^ 2 (N). Show 
that U + U* is a semicircular operator. (Hint: one way to proceed 
here is to use Fourier analysis to identify £ 2 (N) with the space of 
odd functions 6 i-> f(6) on R/27rZ, with {/ being the operator that 
maps sin(n#) to sin((n+ 1)0); show that U + U* is then the operation 
of multiplication by 2cos#.) One can also interpret U as a creation 
operator in a Fock space, but we will not do so here. 

Exercise 2.5.13. With the notation of the previous exercise, show 
that t((U + U*) k ) is zero for odd k, and is equal to the Catalan 
number C k / 2 from Section 2.3 when k is odd. Note that this provides 
a (very) slightly different proof of the semicircular law from that given 
from the moment method in Section 2.4. 

Because we are working in such an abstract setting with so few 
axioms, limits exist in abundance: 

Exercise 2.5.14. For each n, let X n ^, . . . , X n ^ be bounded self- 
adjoint elements of a tracial non-commutative space (A n ,T n ). Sup- 
pose that the spectral radii p(X nt i), . . . , p(X n ^) are uniformly bounded 
in n. Show that there exists a subsequence nj and bounded self- 
adjoint elements X\, . . . , X^ of a tracial non-commutative space (.4, r) 
such that X njt i, . . . ,X n .h converge in moments to X\,...,Xk as 
j — > oo. (Hint: use the Bolzano- Weierstrass theorem and the Arzela- 
Ascoli diagonalisation trick to obtain a subsequence in which each of 
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the joint moments of X n i, . . . , X n k converge as j — > oo. Use these 
moments to build a noncommutative probability space.) 

2.5.3. Free independence. We now come to the fundamental con- 
cept in free probability theory, namely that of free independence. 

Definition 2.5.18 (Free independence). A collection X\, . . . ,Xk of 

random variables in a non-commutative probability space (A, r) is 
freely independent (or free for short) if one has 

r((P 1 (X il ) - r(Pi(XiJ)) . . . {P m (X lm ) - r(P m (X im )))) = 

whenever Pi, ... , P m are polynomials and i\, . . . , i m e {1, . . . , k} are 
indices with no two adjacent ij equal. 

A sequence X n _i , . . . , X n ^ of random variables in a non-commutative 
probability space (A n ,T n ) is asymptotically freely independent (or 
asymptotically free for short) if one has 

r n ((Pi(X nM ) - r(P 1 (X„, il ))) . . . (P m (X n , im ) - T(P m (X n>im )))) 

-> 

as n — > oo whenever Pi, ... , P m are polynomials and ii, . . . ,i m e 
{1, . . . , k} are indices with no two adjacent ij equal. 

Remark 2.5.19. The above example describes freeness of collections 
of random variables A. One can more generally define freeness of col- 
lections of subalgebras of A, which in some sense is the more natural 
concept from a category-theoretic perspective, but we will not need 
this concept here. See e.g. [Bi2003] for more discussion. 

Thus, for instance, if X, Y are freely independent, then t(P(X)Q(Y)R(X)S(Y)) 
will vanish for any polynomials P, Q, R, S for which t(P(X)),t(Q(Y)),t(R(X)),t(S(Y)) 
all vanish. This is in contrast to classical independence of classi- 
cal (commutative) random variables, which would only assert that 
t(P(X)Q(Y)) = whenever t(P(X)),t(Q(Y)) both vanish. 

To contrast free independence with classical independence, sup- 
pose that t(X) = t(Y) = 0. If X, Y were freely independent, then 
t(XYXY) = 0. If instead X, Y were commuting and classically in- 
dependent, then we would instead have t(XYXY) = t(X 2 Y 2 ) = 
t(X 2 )t(Y 2 ), which would almost certainly be non-zero. 
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For a trivial example of free independence, X and Y automat- 
ically arc freely independent if at least one of X, Y is constant (i.e. 
a multiple of the identity 1). In the commutative setting, this is 
basically the only way one can have free independence: 

Exercise 2.5.15. Suppose that X, Y are freely independent elements 
of a faithful non-commutative probability space which also commute. 
Show that at least one of X, Y is equal to a scalar. (Hint: First 
normalise X, Y to have trace zero, and consider t(XYXY).) 

A less trivial example of free independence comes from the free 
group, which provides a clue as to the original motivation of this 
concept: 

Exercise 2.5.16. Let F 2 be the free group on two generators g\,g 2 . 
Let A = -B(^ 2 (F 2 )) be the non-commutative probability space of 
bounded linear operators on the Hilbert space ^ 2 (F 2 ), with trace 
t(X) := (Xe n ,e ), where e is the Kroncckcr delta function at the 
identity. Let U\,U2 € A be the shift operators 

Uif(g) := f(gig); U 2 f(g) := f(g 2 g) 

for / e £ 2 (F 2 ) and g G F 2 . Show that U\, U 2 are freely independent. 

For classically independent commuting random variables X, Y, 
knowledge of the individual moments r(X k ), r{Y k ) gave complete 
information on the joint moments: T{X k Y l ) = T(X k )T(Y l ). The 
same fact is true for freely independent random variables, though the 
situation is more complicated. We begin with a simple case: comput- 
ing t(XY) in terms of the moments of X, Y. From free independence 
we have 

t((X-t(X))(Y-t(Y))=0. 
Expanding this using linear nature of trace, one soon sees that 
(2.123) r(XY) = t(X)t(Y). 

So far, this is just as with the classically independent case. Next, we 
consider a slightly more complicated moment, t(XYX). If we split 
Y = t(Y) + (Y — t(Y)), we can write this as 

t(XYX) = t(Y)t(X 2 ) + t(X(Y - t(Y))X). 
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In the classically independent case, we can conclude the latter term 
would vanish. We cannot immediately say that in the freely inde- 
pendent case, because only one of the factors has mean zero. But 
from (2.123) we know that t(X(Y - t{Y)) = t{(Y - t{Y))X) = 0. 
Because of this, we can expand 

r(X(Y r(Y))X) = r((X r(X))(Y r{Y))(X r(X))) 

and now free independence does ensure that this term vanishes, and 
so 

(2.124) t(XYX) = t{Y)t(X 2 ). 

So again we have not yet deviated from the classically independent 
case. But now let us look at t(XYXY). We split the second X into 
t(X) and X — t(X). Using (2.123) to control the former term, we 
have 

t(XYXY) = t(X) 2 t{Y 2 ) + t(XY(X - t{X))Y). 

From (2.124) we have t(Y(X - t(X))Y) = 0, so we have 

t(XYXY) = t(X) 2 t(Y 2 ) + t((X - t(X))Y(X - t{X))Y). 

Now we split Y into t(Y) and Y—t(Y). Free independence eliminates 
all terms except 

r(XYXY) = r(X) 2 r(Y 2 ) + r((X r(X))r(Y){X r(X))r(Y)) 
which simplifies to 

t(XYXY) = t(X) 2 t(Y 2 ) + t(X 2 )t(Y) 2 - t(X) 2 t(Y) 2 
which differs from the classical independence prediction of t{X 2 )t(Y 2 ). 
This process can be continued: 

Exercise 2.5.17. Let X\, . . . , Xk be freely independent. Show that 
any joint moment of X\,... ,Xk can be expressed as a polynomial 
combination of the individual moments r(Xf) of the Xi. (Hint: in- 
duct on the complexity of the moment.) 

The product measure construction allows us to generate classi- 
cally independent random variables at will (after extending the un- 
derlying sample space): see Exercise 1.1.20. There is an analogous 
construction, called the amalgamated free product, that allows one 
to generate families of freely independent random variables, each of 



210 



2. Random matrices 



which has a specified distribution. Let us give an illustrative special 
case of this construction: 

Lemma 2.5.20 (Free products). For each 1 < i < k, let (-4j,Tj) 
be a non-commutative probability space. Then there exists a non- 
commutative probability space (A, r) which contain embedded copies 
of each of the (Ai, Tj), such that whenever X,- t e At for i = 1, . . . , k, 
then X\ , . . . , Xk are freely independent. 

Proof. (Sketch) Recall that each Ai can be given an inner product 
(,)l2(_4.\. One can then orthogonally decompose each space Ai into 
the constants C, plus the trace zero elements A® := {X € Ai : 
t(X) = 0}. 

We now form the Fock space T to be the inner product space 
formed by the direct sum of tensor products 

(2.125) Al®...®A° m 

where m > 0, and i\, . . . , i m <G {1, . . . , k} are such that no adjacent 
pair ij, ij+\ of the ii, ■ • ■ ,i m are equal. Each element Xi € Ai then 
acts on this Fock space by defining 

x t {Y n ®...xy im ) :=i i ®y il ®...xy im 

when i ^ i\, and 

X t {Y lt ®. . .xY lm ) := T{X t Y tl )Y l2 ®. . .Y.Y im +(X i Y il -T{X i Yi 1 ))®Y ia ®. . 

when i = i\. One can thus map Ai into the space A := Horn (J 7 , J 7 ) 
of linear maps from T to itself. The latter can be given the structure 
of a non-commutative space by defining the trace t(X) of an element 
X € A by the formula t(X) := (Xeg,e0).F, where eg is the vacuum 
state of J 7 , being the unit of the m = tensor product. One can verify 
(Exercise!) that Ai embeds into A and that elements from different 
Ai are freely independent. □ 

Exercise 2.5.18. Complete the proof of Lemma 2.5.20. (Hint: you 
may find it helpful to first do Exercise 2.5.16, as the construction here 
is in an abstraction of the one in that exercise.) 

Finally, we illustrate the fundamental connection between free 
probability and random matrices first observed by Voiculescu[Vol991], 
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namely that (classically) independent families of random matrices are 
asymptotically free. The intuition here is that while a large random 
matrix M will certainly correlate with itself (so that, for instance, 
trM*M will be large), once one interposes an independent random 
matrix N of trace zero, the correlation is largely destroyed (thus, for 
instance, trM*NM will usually be quite small). 

We give a typical instance of this phenomenon here: 

Proposition 2.5.21 (Asymptotic freeness of Wigner matrices). Let 
M n< \, . . . , M n< k be a collection of independent nxn Wigner matrices, 
where the coefficients all have uniformly bounded m th moments for 
each m. Then the random variables ^M ni i, . . . , -^M n ^ € (L°°~ (g> 
M„(C),E-tr) are asymptotically free. 

Proof. (Sketch) Let us abbreviate ^=M n j as Xj (suppressing the n 
dependence). It suffices to show that the traces 

m 

t(y[(x:;-t(x:p))=o(i) 

for each fixed choice of natural numbers a±, . . . , a m , where no two 
adjacent are equal. 

Recall from Section 2.3 that t(Xj j ) is (up to errors of o(l)) equal 
to a normalised count of paths of length aj in which each edge is tra- 
versed exactly twice, with the edges forming a tree. After normalisa- 
tion, this count is equal to when aj is odd, and equal to the Catalan 
number C a ./ 2 when aj is even. 

One can perform a similar computation to compute t(J|™ =1 X® 3 ). 
Up to errors of o(l), this is a normalised count of coloured paths of 
length ai + ■ ■ ■ + a m , where the first a\ edges are coloured with colour 
i\, the next a 2 with colour i 2 , etc. Furthermore, each edge is traversed 
exactly twice (with the two traversals of each edge being assigned the 
same colour), and the edges form a tree. As a consequence, there 
must exist a j for which the block of aj edges of colour ij form their 
own sub-tree, which contributes a factor of C a ./ 2 or to the final 
trace. Because of this, when one instead computes the normalised 
expression t(Y\J =1 (X^ j — t(X^))), all contributions that are not 
o(l) cancel themselves out, and the claim follows. □ 
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Exercise 2.5.19. Expand the above sketch into a full proof of the 
above theorem. 

Remark 2.5.22. This is by no means the only way in which random 
matrices can become asymptotically free. For instance, if instead one 
considers random matrices of the form M n i = U* AJJi, where are 
deterministic Hcrmitian matrices with uniformly bounded eigenval- 
ues, and the Ui are iid unitary matrices drawn using Haar measure 
on the unitary group U(n), one can also show that the M n ^ are 
asymptotically free; again, see [Vol991] for details. 

2.5.4. Free convolution. When one is summing two classically in- 
dependent (real-valued) random variables X and Y, the distribution 
Hx+y of the sum X + Y is the convolution \ix * Hy of the distribu- 
tions [ix and [ly ■ This convolution can be computed by means of the 
characteristic function 

F x (t) :=r(e itx )= f e lt * d^ x (x) 

JR 

by means of the simple formula 

T{e^ x+Y 'y) =T{e ux ) T {e UY ). 

As we saw in Section 2.2, this can be used in particular to establish 
a short proof of the central limit theorem. 

There is an analogous theory when summing two freely indepen- 
dent (self-adjoint) non-commutative random variables X and Y; the 
distribution fix+Y turns out to be a certain combination [i x EH jiy, 
known as the free convolution of fix and /xy. To compute this free 
convolution, one does not use the characteristic function; instead, the 
correct tool is the Stieltjes transform 

s x {z) := t((X - z)- 1 ) = / — *— dfix(x) 

Jr. x - z 

which has already been discussed earlier. 

Here's how to use this transform to compute free convolutions. If 
one wishes, one can that X is bounded so that all series involved con- 
verge for z large enough, though actually the entire argument here can 
be performed at a purely algebraic level, using formal power series, 
and so the boundedness hypothesis here is not actually necessary. 
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The trick (which we already saw in Section 2.4) is not to view 
s = sx(z) as a function of z, but rather to view z — zx(s) as a 
function of s. Given that one asymptotically has s ~ — 1/z for z, we 
expect to be able to perform this inversion for z large and s close to 
zero; and in any event one can easily invert (2.119) on the level of 
formal power series. 

With this inversion, we thus have 
(2.126) s = t((X-z x (s))- 1 ) 

and thus 

(X-zx(s))- 1 = s(l -E x ) 

for some Ex = Ex{s) of trace zero. Now we do some (formal) alge- 
braic sleight of hand. We rearrange the above identity as 

X = z x (s) + S - 1 (l-Ex)- 1 . 

Similarly we have 

Y = z Y (s) + s- 1 {l- Ey)- 1 

and so 

X + Y = z x (s) + z Y (s) + s-'Kl - Ex)- 1 + (1 - Ey)- 1 }. 
We can combine the second two terms via the identity 
(l-Exy' + il-Ey)- 1 = (l-Exr'il-EY + l-ExKl-Ey)- 1 . 
Meanwhile 

1 = (1 - Ex)- 1 ^ -Ey-E X + E X Ey)(l - Ey)- 1 

and so 

X + Y = zx( S ) + zy(s) + S - 1 + S - 1 [(1-Ex)- 1 (1-E x Ey)(1-Ey)- 1 }. 
We can rearrange this a little bit as 

(X + Y-zx(s)-z Y (s)- S - 1 )- 1 = sKl-Ey^l-ExEYr'il-Ex)]. 
We expand out as (formal) Neumann series: 

(l-E Y )(l-ExE Y )- 1 (l-Ex) - (l-E Y )(l+ExE Y +E x E Y E x E Y +. . .)(1-E X ). 

This expands out to equal 1 plus a whole string of alternating products 
of E x and E Y . 
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Now we use the hypothesis that X and Y are free. This easily 
implies that Ex and Ey are also free. But they also have trace zero, 
thus by the definition of free independence, all alternating products 
of Ex and Ey have zero trace 45 . We conclude that 

r((l - E Y )(l - E x E Y )-\l - E x )) = 1 

and so 

t((X + Y-z x (s)- z Y (s) - s- 1 )- 1 ) = 8. 
Comparing this against (2.126) for X + Y we conclude that 

z x +y(s) = z x {s) + z Y (s) + s^ 1 . 

Thus, if we define the R-transform Rx of X to be (formally) given 
by the formula 

Rx{s) := z x (-s) - s- 1 
then we have the addition formula 

Rx+y = Rx + Ry- 

Since one can recover the Stieltjes transform s x (and hence the R- 
transform R x ) from the spectral measure [x x and vice versa, this 
formula (in principle, at least) lets one compute the spectral measure 
Hx+y °f X + Y from the spectral measures fj,x,HY, thus allowing one 
to define free convolution. 

For comparison, we have the (formal) addition formula 

log F x +y = log F x + log Fy 

for classically independent real random variables X, Y. The following 
exercises carry this analogy a bit further. 

Exercise 2.5.20. Let X be a classical real random variable. Working 
formally, show that 

iogJMt) = £^P(it) fc 

k=l 



In the case when there arc an odd number of terms in the product, one can 
obtain this zero trace property using the cyclic property of trace and induction. 
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where the cumulants Kk(X) can be reconstructed from the moments 
r(X k ) by the recursive formula 

k— 1 

r(X k ) = Kk (X) + E r(X-+-+^) 

j=l aiH \- aj =k-j 

for fc > 1. (ffint: start with the identity f t F x {t) = [f t log F x (t))F x (t).) 
Thus for instance = t(-X^) is the expectation, k,2(X) — t(X 2 ) — 

t{X) 2 is the variance, and the third cumulant is given by the formula 

k 3 (X) = t(X 3 ) + MX 2 )t{X) - At(X)\ 

Establish the additional formula 

r(X k ) = Y J \\C^{X) 

n Ae-rr 

where ir ranges over all partitions of {1, . . . , fc} into non-empty cells 
A. 

Exercise 2.5.21. Let X be a non-commutative random variable. 
Working formally, show that 

oo 

R x (s) = J2Ck(X)s k - 1 
fc=i 

where the free cumulants Ck{X) can be reconstructed from the mo- 
ments r(X k ) by the recursive formula 
fc-i 

r(X k ) = C k (X) + Y / C J (X) Yl T{X a *)...T{X a >) 

j = l oiH Vaj = k-j 

for fc > 1. (Hint: start with the identity s x (z)R x (— s x (z)) = 
1 + zs x (z).) Thus for instance C\(X) = t(X) is the expectation, 
C2{X) = t(X 2 ) — t(X) 2 is the variance, and the third free cumulant 
is given by the formula 

C 3 (X) = t(X 3 ) - 3t(X 2 )t(X) + 2t(X) 3 . 

Establish the additional formula 

r(X k ) = J2l[ KlAl (X) 

where tt ranges over all partitions of {1, . . . , fc} into non-empty cells 
A which are non-crossing, which means that if a < b < c < d lie in 
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{1, . . . , k}, then it cannot be the case that a, c lie in one cell A while 
6, d lie in a distinct cell A' . 

Remark 2.5.23. These computations illustrate a more general prin- 
ciple in free probability, in that the combinatorics of free probability 
tend to be the "non-crossing" analogue of the combinatorics of clas- 
sical probability; compare with Remark 2.3.18. 

Remark 2.5.24. The i?-transform allows for efficient computation 
of the spectral behaviour of sums X + Y of free random variables. 
There is an analogous transform, the S -transform, for computing the 
spectral behaviour (or more precisely, the joint moments) of products 
XY of free random variables; see for instance [Sp]. 

The i?-transform clarifies the privileged role of the semicircular 
elements: 

Exercise 2.5.22. Let u be a semicircular element. Show that R^/i u {s) = 
ts for any t > 0. In particular, the free convolution of y/iu and \/Vu 
is \ft + t'u. 

Exercise 2.5.23. From the above exercise, we see that the effect of 
adding a free copy of y/iu to a non-commutative random variable X 
is to shift the i?-transform by ts. Explain how this is compatible with 
the Dyson Brownian motion computations in Section 2.4. 

It also gives a free analogue of the central limit theorem: 

Exercise 2.5.24 (Free central limit theorem). Let X be a self-adjoint 
random variable with mean zero and variance one (i.e. t(X) = and 
t(X 2 ) = 1), and let Xi,X 2 ,X 3 , ... be free copies of X. Let S n := 
(X\ + • • • + X n )/y/n. Show that the coefficients of the formal power 
series i?s„(s) converge to that of the identity function s. Conclude 
that S n converges in the sense of moments to a semicircular element 
u. 

The free central limit theorem implies the Wigner semicircular 
law, at least for the GUE ensemble and in the sense of expectation. 
Indeed, if M n is an n x n GUE matrix, then the matrices ^M„ are 
a.s. uniformly bounded (by the Bai-Yin theorem, Notes 3), and so 
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(after passing to a subsequence, if necessary), they converge in the 
sense of moments to some limit u. 

On the other hand, if M' n is an independent copy of M n , then 
M n + M' n = \f2M n from the properties of Gaussians. Taking limits, 
we conclude that u + u' = \/2u, where (by Proposition 2.5.21) v! is 
a free copy of u. Comparing this with the free central limit theorem 
(or just the additivity property of i?-transforms we see that u must 
have the semicircular distribution. Thus the semicircular distribu- 
tion is the only possible limit point of the -^=M„, and the Wigner 
semicircular law then holds (in expectation, and for GUE) . Using con- 
centration of measure, we can upgrade the convergence in expectation 
to a.s. convergence; using the Lindeberg replacement trick one can 
replace GUE with arbitrary Wigner matrices with (say) bounded co- 
efficients; and then by using the truncation trick one can remove the 
boundedness hypothesis. (These latter few steps were also discussed 
in Section 2.4.) 

2.6. Gaussian ensembles 

Our study of random matrices, to date, has focused on somewhat 
general ensembles, such as iid random matrices or Wigner random 
matrices, in which the distribution of the individual entries of the 
matrices was essentially arbitrary (as long as certain moments, such 
as the mean and variance, were normalised). In this section, we now 
focus on two much more special, and much more symmetric, ensem- 
bles: 

(i) The Gaussian Unitary Ensemble (GUE), which is an ensem- 
ble of random n x n Hcrmitian matrices M n in which the 
upper-triangular entries are iid with distribution N(0, l)c, 
and the diagonal entries are iid with distribution N(0, 1)r, 
and independent of the upper-triangular ones; and 

(ii) The Gaussian random matrix ensemble, which is an ensem- 
ble of random n x n (non-Hermitian) matrices M n whose 
entries are iid with distribution N(Q, l)c- 

The symmetric nature of these ensembles will allow us to com- 
pute the spectral distribution by exact algebraic means, revealing a 
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surprising connection with orthogonal polynomials and with determi- 
nantal processes. This will, for instance, recover the semicircular law 
for GUE, but will also reveal fine spacing information, such as the 
distribution of the gap between adjacent eigenvalues, which is largely 
out of reach of tools such as the Stieltjes transform method and the 
moment method (although the moment method, with some effort, is 
able to control the extreme edges of the spectrum). 

Similarly, we will see for the first time the circular law for eigen- 
values of non-Hermitian matrices. 

There are a number of other highly symmetric ensembles which 
can also be treated by the same methods, most notably the Gaussian 
Orthogonal Ensemble (GOE) and the Gaussian Symplectic Ensem- 
ble (GSE). However, for simplicity we shall focus just on the above 
two ensembles. For a systematic treatment of these ensembles, see 
[Del999]. 

2.6.1. The spectrum of GUE. We have already shown using Dyson 
Brownian motion in Section 3.1 that that we have the Ginibre formula[Gil965] 

(2-127) Pn {\) = ^-L^ e -|A|V2| Ari(A) |2 

for the density function of the eigenvalues (Ai,...,A„) e R" of a 
GUE matrix M„, where 

A„(A)= [] (A, -A,) 

l<i<j<n 

is the Vandermonde determinant. We now give an alternate proof 
of this result (omitting the exact value of the normalising constant 
(2tt)"/ 2 ) ^at exploits unitary invariance and the change of variables 
formula (the latter of which we shall do from first principles). The 
one thing to be careful about is that one has to somehow quotient 
out by the invariances of the problem before being able to apply the 
change of variables formula. 

One approach here would be to artificially "fix a gauge" and work 
on some slice of the parameter space which is "transverse" to all the 
symmetries. With such an approach, one can use the classical change 
of variables formula. While this can certainly be done, we shall adopt 
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a more "gauge-invariant" approach and carry the various invariances 
with us throughout the computation 46 

We turn to the details. Let V n be the space of Hcrmitian n x n 
matrices, then the distribution /im„ of a GUE matrix M n is a abso- 
lutely continuous probability measure on V n , which can be written 
using the definition of GUE as 

mm„=c„( n e -i«-i 2 )( n e-'«" |2/2 ) dM n 

l<i<j<n l<i<n 

where dM n is Lebesgue measure on V, £ij are the coordinates of M n , 
and C n is a normalisation constant (the exact value of which depends 
on how one normalises Lebesgue measure on V) . We can express this 
more compactly as 

t, Mn =C n e-^ M ^ 2 dM n . 

Expressed this way, it is clear that the GUE ensemble is invariant 
under conjugations M n i-> {7M„[/ _1 by any unitary matrix. 

Let D be the diagonal matrix whose entries Ai > . . . > A„ are the 
eigenvalues of M n in descending order. Then we have M n = UDU^ 1 
for some unitary matrix U G U(n). The matirx U is not uniquely 
determined; if R is diagonal unitary matrix, then R commutes with 
D, and so one can freely replace U with UR. On the other hand, if the 
eigenvalues of M are simple, then the diagonal matrices are the only 
matrices that commute with D, and so this freedom to right-multiply 
U by diagonal unitaries is the only failure of uniqueness here. And in 
any case, from the unitary invariance of GUE, we see that even after 
conditioning on D, we may assume without loss of generality that U 
is drawn from the invariant Haar measure on U (n) . In particular, U 
and D can be taken to be independent. 

Fix a diagonal matrix Do — diag(A] ) , . . . , A°) for some A? > . . . > 
A°, let e > be extremely small, and let us compute the probability 

(2.128) P(||M n -D ||F<e) 

that M n lies within e of D in the Frobenius norm(2.64). On the one 
hand, the probability density of M n is proportional to 

e -tr(£>g)/2 = e -|A°| 2 /2 



'For a comparison of the two approaches, sec [Ta2009b, §1.4]. 
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near D (where we write A := (Aj, . . . , A° )) and the volume of a ball 
of radius e in the n -dimensional space V n is proportional to e" , so 

(2.128) is equal to 

(2.129) (C; + o(l))£ n V tr(D ° )/2 

for some constant C' n > depending only on n, where o(l) goes 
to zero as e — > (keeping n and D fixed). On the other hand, if 
\\M n — Dq\\f < £, then by the Weyl inequality (1.55) (or Wcilandt- 
Hoffman inequality (1.65)) we have D = Do + O(e) (we allow implied 
constants here to depend on n and on Do). This implies UDU^ 1 = 
D + O(e), thus UD — DU = O(e). As a consequence we see that 
the off-diagonal elements of U are of size 0(e). We can thus use the 
inverse function theorem in this local region of parameter space and 
make the ansatz 47 

D = D a + eE; U = cxp{eS)R 

where E is a bounded diagonal matrix, R is a diagonal unitary matrix, 
and S is a bounded skew-adjoint matrix with zero diagonal. Note that 
the map (R, S) i-> exp(eS)R has a non-degenerate Jacobian, so the 
inverse function theorem applies to uniquely specify R, S (and thus 
E) from U, D in this local region of parameter space. 

Conversely, if D, U take the above form, then we can Taylor ex- 
pand and conclude that 

M n = UDU* =D +eE + e(SD - D S) + 0(e 2 ) 

and so 

\\M n - D \\ F =e\\E + (SD - D S)\\ F + 0(e 2 ). 

We can thus bound (2.128) from above and below by expressions of 
the form 

(2.130) P(||E + (SD - D S)\\ F < 1 + 0(e)). 

As U is distributed using Haar measure on U(n), S is (locally) dis- 
tributed using e n ~ n times a constant multiple of Lebesgue measure 
on the space W of skew-adjoint matrices with zero diagonal, which has 
dimension n 2 —n. Meanwhile, E is distributed using (p„(A°) + o(l))e n 



'Note here the emergence of the freedom to right-multiply U by diagonal 
unitaries. 
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times Lebesgue measure on the space of diagonal elements. Thus we 
can rewrite (2.130) as 

C;' £ " 2 ( P „(A°) + (1)) / / dEdS 

J J\\E+(SD -D S)\\ F <1+O(e) 

where dE and dS denote Lebesgue measure and C% > depends only 
on n. 

Observe that the map S ^ SDq — DoS dilates the (complex- 
valued) ij entry of S by A" — A? , and so the Jacobian of this map is 
ni<i<j<„ \^j - A i*| 2 = |A„(A°)| 2 . Applying the change of variables, 
we can express the above as 

0"'^t#> f I 

|A„(A U )| 2 J J\\ E +S\\ F <l+0(e) 

The integral here is of the form C^" + O(e) for some other constant 
C"' > 0. Comparing this formula with (2.129) we see that 

p„(a°) + (i) = c;;" e -i A °i 2 / 2 A„(A°)i 2 + (i) 

for yet another constant C"" > 0. Sending e->0we recover an exact 
formula 

Pn (A)+o(l) = Cr e - |A|2/2 |An(A)| 2 
when A is simple. Since almost all Hermitian matrices have simple 
spectrum (see Exercise 1.3.10), this gives the full spectral distribution 
of GUE, except for the issue of the unspecified constant. 

Remark 2.6.1. In principle, this method should also recover the ex- 
plicit normalising constant ^y/i in (2.127), but to do this it appears 
one needs to understand the volume of the fundamental domain of 
U(n) with respect to the logarithm map, or equivalently to under- 
stand the volume of the unit ball of Hermitian matrices in the oper- 
ator norm. I do not know of a simple way to compute this quantity 
(though it can be inferred from (2.127) and the above analysis). One 
can also recover the normalising constant through the machinery of 
dctcrminantal processes, see below. 

Remark 2.6.2. The above computation can be generalised to other 
U (n)-conjugation-invariant ensembles M n whose probability distribu- 
tion is of the form 

Mm„ =C n e~ tTV ^ dM n 
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for some potential function V : R — > R (where we use the spectral 
theorem to define V(M n )), yielding a density function for the spec- 
trum of the form 



Given suitable regularity conditions on V, one can then generalise 
many of the arguments in this section to such ensembles. Sec [Del999] 
for details. 

2.6.2. The spectrum of Gaussian matrices. The above method 
also works for Gaussian matrices G, as was first observed by Dyson 
(though the final formula was first obtained by Ginibre, using a dif- 
ferent method). Here, the density function is given by 



where C n > is a constant and dG is Lebesgue measure on the space 
M„(C) of all complex n x n matrices. This is invariant under both 
left and right multiplication by unitary matrices, so in particular is 
invariant under unitary conjugations as before. 

This matrix G has n complex (generalised) eigenvalues <r(G) = 
{Ai, . . . , A„}, which are usually distinct: 

Exercise 2.6.1. Let n > 2. Show that the space of matrices in 
M„(C) with a repeated eigenvalue has codimension 2. 

Unlike the Hermitian situation, though, there is no natural way 
to order these n complex eigenvalues. We will thus consider all n\ pos- 
sible permutations at once, and define the spectral density function 
p„(Ai, . . . , A„) of G by duality and the formula 



for all test functions F. By the Riesz representation theorem, this 
uniquely defines p n (as a distribution, at least), although the total 
mass of p n is n! rather than 1 due to the ambiguity in the spectrum. 

Now we compute p n (up to constants). In the Hermitian case, 
the key was to use the factorisation M n — UDU' 1 . This particu- 
lar factorisation is of course unavailable in the non-Hermitian case. 



p„(A)=C; ie -^-^|A„(A)| 2 . 



(2.131) 



C n e~^ GG ">dG = C n e-^dG 





E ^(Ai,...,A„) 
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However, if the non-Hcrmitian matrix G has simple spectrum, it can 
always be factored instead as G = UTU^ 1 , where U is unitary and T 
is upper triangular. Indeed, if one applies the Gram-Schmidt process 
to the eigenvectors of G and uses the resulting orthonormal basis to 
form U, one easily verifies the desired factorisation. Note that the 
eigenvalues of G are the same as those of T, which in turn are just 
the diagonal entries of T . 

Exercise 2.6.2. Show that this factorisation is also available when 
there are repeated eigenvalues. {Hint: use the Jordan normal form.) 

To use this factorisation, we first have to understand how unique 
it is, at least in the generic case when there are no repeated eigenval- 
ues. As noted above, if G = UTU^ 1 , then the diagonal entries of T 
form the same set as the eigenvalues of G. We have the freedom to 
conjugate T by a permutation matrix P to obtain P~ 1 TP, and right- 
multiply U by P to counterbalance this conjugation; this permutes 
the diagonal entries of T around in any one of n\ combinations. 

Now suppose we fix the diagonal Ai, . . . , A„ of T, which amounts 
to picking an ordering of the n eigenvalues of G. The eigenvalues of 
T are Ai, . . . , A n , and furthermore for each 1 < j < n, the eigenvector 
of T associated to Xj lies in the span of the last n — j + 1 basis vectors 
ej,.. . ,e„ of C", with a non-zero ej coefficient (as can be seen by 
Gaussian elimination or Cramer's rule). As G — UTU^ 1 with U 
unitary, we conclude that for each 1 < j < n, the j th column of 
U lies in the span of the eigenvectors associated to Xj , . . . , X n . As 
these columns are orthonormal, they must thus arise from applying 
the Gram-Schmidt process to these eigenvectors (as discussed earlier) . 
This argument also shows that once the diagonal entries Ai, . . . , X n 
of T are fixed, each column of U is determined up to rotation by a 
unit phase. In other words, the only remaining freedom is to replace 
U by UR for some unit diagonal matrix R, and then to replace T by 
R~ X TR to counterbalance this change of U. 

To summarise, the factorisation G = UTU^ 1 is unique up to 
right-multiplying U by permutation matrices and diagonal unitary 
matrices (which together generate the Weyl group of the unitary 
group U(n)), and then conjugating T by the same matrix. Given 
a matrix G, we may apply these symmetries randomly, ending up 
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with a random factorisation UTU' 1 such that the distribution of T 
is invariant under conjugation by permutation matrices and diago- 
nal unitary matrices. Also, since G is itself invariant under unitary 
conjugations, we may also assume that U is distributed uniformly 
according to the Haar measure of U(n), and independently of T. 

To summarise, the Gaussian matrix ensemble G can almost surely 
be factorised as UTU -1 , where T = (tij)\<i<j< n is an upper-triangular 
matrix distributed according to some distribution 

4>((Uj)l<i<j<n) | T dtij 
\<i<j<n 

which is invariant with respect to conjugating T by permutation ma- 
trices or diagonal unitary matrices, and U is uniformly distributed 
according to the Haar measure of U (n) , independently of T. 

Now let T = (tij)i<i<j<n be an upper triangular matrix with 
complex entries whose entries t^,. . . ,i° n € C are distinct. As in the 
previous section, we consider the probability 

(2.132) P(||G-T || F < £ ). 

On the one hand, since the space M n (C) of complex n x n matrices 
has 2n 2 real dimensions, we see from (2.131) that this expression is 
equal to 

(2.133) (C; i + (l)) e -ll T °ll^e 2n2 
for some constant C' n > 0. 

Now we compute (2.132) using the factorisation G = UTU^ 1 . 
Suppose that ||G — To\\p < e, so G = Tq + 0(e) As the eigenvalues of 
To are t^, . . . , which are assumed to be distinct, we see (from the 
inverse function theorem) that for e small enough, G has eigenvalues 
*ii + O(e), . . . , + 0(e). Thus the diagonal entries of T are some 
permutation of t\ x + 0(e), . . . ,t^ n + 0(e). As we are assuming the 
distribution of T to be invariant under conjugation by permutation 
matrices, all permutations here are equally likely, so with probabil- 
ity 48 1/n!, we may assume that the diagonal entries of T are given by 
t° n + 0(e), . . . , t° nn + 0(e) in that order. 



"The factor of 1/n! will eventually be absorbed into one of the unspecified 
constants. 
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Let Ui, . . . , u° be eigenvector of T associated to t^, . . . , t° n , then 
the Gram-Schmidt process applied to u\, . . . ,u n (starting at u° and 
working backwards to uf) gives the standard basis e\, . . . , e n (in re- 
verse order). By the inverse function theorem, we thus see that 
we have eigenvectors Ui = u\ + 0(e), . . . , u n = u° + 0(e) of G, 
which when the Gram-Schmidt process is applied, gives a perturba- 
tion e\+0(e), . . . , e n +0(e) in reverse order. This gives a factorisation 
G = UTU- 1 in which U = 1 + 0(e), and hence T = T a + 0(e). This 
is however not the most general factorisation available, even after fix- 
ing the diagonal entries of T, due to the freedom to right-multiply U 
by diagonal unitary matrices R. We thus see that the correct ansatz 
here is to have 

U = R + 0(e); T = R~ 1 T I3 R + 0(e) 

for some diagonal unitary matrix R. 

In analogy with the GUE case, we can use the inverse function 
theorem make the more precise ansatz 

U = cxp(eS)R; T = iT 1 (T + eE)R 

where S is skew-Hermitian with zero diagonal and size 0(1), R is diag- 
onal unitary, and E is an upper triangular matrix of size 0(1). From 
the invariance U h-> UR;T i-> R^TR we see that R is distributed 
uniformly across all diagonal unitaries. Meanwhile, from the unitary 
conjugation invariance, S is distributed according to a constant mul- 

2 Q 

tiple of e n ~ n times Lebesgue measure dS on the n — n-dimensional 
space of skew Hermitian matrices with zero diagonal; and from the 
definition of tp, E is distributed according to a constant multiple of 
the measure 

(1 + o(l)) £ " 2+ XT ) dE, 

where dE is Lebesgue measure on the n 2 + n-dimensional space of 
upper-triangular matrices. Furthermore, the invariances ensure that 
the random variables S, R, E are distributed independently. Finally, 
we have 



G = UTU- 1 = cxp(eS*)(To + sE) exp(-eS). 
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Thus we may rewrite (2.132) as 
(2.134) 

(C;V(T ) + o(l))e 2 " 2 / / dSdE 

J J j| cxp(sS)(T„+eE) exp(-eS)-T || F <e 

for some C," > (the R integration being absorbable into this con- 
stant C"). We can Taylor expand 

cxp( £ 5)(T + eE) cxp(-eS) = T a + e(E + ST - T Q S) + 0(e 2 ) 

and so we can bound (2.134) above and below by expressions of the 
form 

(O(T ) + o(l))e 2 " 2 f f dSdE. 

J J\\E+ST a -T„S\\ F <l+0(s) 

The Lebesgue measure dE is invariant under translations by upper 
triangular matrices, so we may rewrite the above expression as 

(2.135) (O(T ) + o(l))e 2 " 2 / / dSdE, 

where tt(STo — TqS) is the strictly lower triangular component of 
ST a - T S. 

The next step is to make the (linear) change of variables V := 
tt(STq — TqS). We check dimensions: S ranges in the space S of 
skew-adjoint Hermitian matrices with zero diagonal, which has di- 
mension (n 2 — n)/2, as does the space of strictly lower-triangular 
matrices, which is where V ranges. So we can in principle make this 
change of variables, but we first have to compute the Jacobian of the 
transformation (and check that it is non-zero). For this, we switch 
to coordinates. Write S = (sjj)i<i j< n and V = (wij)i<j<j<„. In 
coordinates, the equation V = ir(ST — T S) becomes 

j n 

Vij — ^ Sifeifej — ^ tik s kj 
k—1 k—i 

or equivalently 

J-l n 
k=l fe=i+l 
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Thus for instance 

v n i = (til ~ t) s «i 

Vn2 = (t 2 2 ~ tnn) s n2 + t° 12 Snl 

v (n-l)l = {til - *(n-l)(n-l)) S (ri-l)l ~ *(n-l)n S ril 

Vn3 = (*33 - t^ n )s n3 + t® 3 S n i + t2 3 S n2 

v (n-l)2 = (*22 — *(n-l)(n-l)) S (n-l)2 + *12 s (ri-l)l _ *(n-l)n S "2 

u (n-2)l = (*11 - *(n-2)(n-2)) S ("-2)l ~ *(n-2)(n-l) S («-l)l - *(n-2)n S "l 

etc. We then observe that the transformation matrix from s n i, s n2 , S( n -i)i, ■ • ■ 
to v n i,v n2 , . . . is triangular, with diagonal entries given by 

t%j ~ *ii for 1 < j < i < n. The Jacobian of the (complex-linear) map 
S* y is thus given by 

i n *?i-*&i a =iA(tSi,-,oi a 

l<j<z<n 

which is non-zero by the hypothesis that the ij 1 , . . . , t^ n are distinct. 
We may thus rewrite (2.135) as 

,g (T ° )+ ; ( ;W / 

where dV is Lebesgue measure on strictly lower-triangular matrices. 
The integral here is equal to C'^ + 0(e) for some constant C'£ . Com- 
paring this with (2.132), cancelling the factor of e 2n , and sending 
£ — > 0, we obtain the formula 

^)i<i<i<n) = criA(t; 1 ,...,^ji 2 e-" r ''"5- 

for some constant C^" > 0. We can expand 

e -||T || 2 F = TT e -l*^l 2 . 

l<i<j<n 

If we integrate out the off-diagonal variables t®j for 1 < i < j < n, we 
see that the density function for the diagonal entries (Ai, . . . , A„) of 
T is proportional to 

|A(A 1) ...,A„)|V^= 1 I^I 2 . 
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Since these entries are a random permutation of the eigenvalues of G, 
we conclude the Ginibre formula 

(2.136) p n {\ u . . . , A„) = c„|A(A 1; . . . , \ n )\ 2 e-^U ^ 

for the joint density of the eigenvalues of a Gaussian random matrix, 
where c„ > is a constant. 

Remark 2.6.3. Given that (2.127) can be derived using Dyson Brow- 
nian motion, it is natural to ask whether (2.136) can be derived by a 
similar method. It seems that in order to do this, one needs to con- 
sider a Dyson-like process not just on the eigenvalues Ai, . . . , A„, but 
on the entire triangular matrix T (or more precisely, on the moduli 
space formed by quotienting out the action of conjugation by uni- 
tary diagonal matrices). Unfortunately the computations seem to get 
somewhat complicated, and we do not present them here. 

2.6.3. Mean field approximation. We can use the formula (2.127) 
for the joint distribution to heuristically derive the semicircular law, 
as follows. 

It is intuitively plausible that the spectrum (Ai, . . . , A„) should 
concentrate in regions in which p n (Ai, . . . , A„) is as large as possible. 
So it is now natural to ask how to optimise this function. Note that 
the expression in (2.127) is non-negative, and vanishes whenever two 
of the Ai collide, or when one or more of the A^ go off to infinity, so a 
maximum should exist away from these degenerate situations. 

We may take logarithms and write 

n 1 ] 

(2.137) -logp„(A 1 ,...,A„) = £-|A J | 2 + ]r£log— — +C 

j=l ijij 1 J ' 

where C — C n is a constant whose exact value is not of importance 
to us. From a mathematical physics perspective, one can interpret 
(2.137) as a Hamiltonian for n particles at positions Ai, . . . , A„, sub- 
ject to a confining harmonic potential (these are the ^ | A ^- 1 2 terms) 
and a repulsive logarithmic potential between particles (these are the 



Our objective is now to find a distribution of Ai,...,A„ that 
minimises this expression. 
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We know from previous sections that the Aj should be have mag- 
nitude O(tJti). Let us then heuristically make a mean field approxima- 
tion, in that we approximate the discrete spectral measure ^ Y^j=i ^Xj/^/n 
by a continuous 49 probability measure p(x) dx. Then we can heuris- 
tically approximate (2.137) as 

n 2 \x 2 p{x) dx + J r J r 1o & \^-^P( x )p(y) dxd v) + C 'n 
and so we expect the distribution p to minimise the functional 
(2.138) ^x 2 p(x) dx + 

Li 

log -^—^p(x)p(y) dxdy. 

One can compute the Euler-Lagrange equations of this functional: 

Exercise 2.6.3. Working formally, and assuming that p is a proba- 
bility measure that minimises (2.138), argue that 

1 2 xi + 2j R lo gj -^- l p(y)dy = C 

for some constant C and all x in the support of p. For all x outside 
of the support, establish the inequality 

^ + 2 ^ log dy > c . 

There are various ways we can solve this equation for p; we sketch 
here a complex-analytic method. Differentiating in x, we formally 
obtain 

x — 2p.v. / p(y) dy = 

on the support of p. But recall that if we let 



s(z) := / — 1 — dy 
Jr V - * 



be the Stieltjes transform of the probability measure p(x) dx, then 
we have 

Im(,s(.i + i0 + )) = irp(x) 

and 

Re(s(x + i0 + )) = —p.v. / p(y) dy. 

Jnx-y 



^Secretly, we know from the semicircular law that wc should be able to take 
p — ^(4 — £ 2 ) + , but pretend that we do not know this fact yet. 
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We conclude that 

(x + 2 Rc(s(x + iO+))Im(s(a; + i0+))) = 

for all x, which we rearrange as 

Im(s 2 (x + i0 + ) + xs(x + i0 + )) = 0. 

This makes the function f(z) = s 2 (z) + zs(z) entire (it is analytic in 
the upper half-plane, obeys the symmetry f(z) = f(z), and has no 
jump across the real line). On the other hand, as s(z) = as 
z — > oo, / goes to —1 at infinity. Applying Liouville's theorem, we 
conclude that / is constant, thus we have the familiar equation 

s 2 + zs = -1 

which can then be solved to obtain the semicircular law as in Section 
2.4. 

Remark 2.6.4. Recall from Section 3.1 that Dyson Brownian motion 
can be used to derive the formula (2.127). One can then interpret 
the Dyson Brownian motion proof of the semicircular law for GUE 
in Section 2.4 as a rigorous formalisation of the above mean field 
approximation heuristic argument. 

One can perform a similar heuristic analysis for the spectral mea- 
sure pa °f a random Gaussian matrix, giving a description of the 
limiting density: 

Exercise 2.6.4. Using heuristic arguments similar to those above, 
argue that pq should be close to a continuous probability distribution 
p(z) dz obeying the equation 

\z\ 2 + [ log-. — - — rp(w)dw = C 
Jc \z-w\ 

on the support of p, for some constant C, with the inequality 

(2.139) \z\ 2 + [ log -. — l — l p(w) dw > C. 

Jc \z-w\ 

Using the Newton potential ^- log \ z\ for the fundamental solution of 
the two-dimensional Laplacian — d 2 — d 2 , conclude (non-rigorously) 
that p is equal to - on its support. 
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Also argue that p should be rotationally symmetric. Use (2.139) 
and Green's formula to argue why the support of p should be simply 
connected, and then conclude (again non-rigorously) the circular law 

(2.140) p G « -1 N<1 dz. 

We will see more rigorous derivations of the circular law later in 
this text. 

2.6.4. Determinantal form of the GUE spectral distribution. 

In a previous section, we showed (up to constants) that the density 
function p n {\\, . . . , A„) for the eigenvalues Ai > . . . > A„ of GUE was 
given by the formula (2.127). 

As is well known, the Vandermonde determinant A(Ai, . . . , A„) 
that appears in (2.127) can be expressed up to sign as a determinant 
of an n x n matrix, namely the matrix (A^ _ )i<i,j< n - Indeed, this 
determinant is clearly a polynomial of degree n(n— l)/2 in Ai, . . . , \ n 
which vanishes whenever two of the A^ agree, and the claim then 
follows from the factor theorem (and inspecting a single coefficient of 
the Vandermonde determinant, e.g. the ^j" 1 coefficient, to get 

the sign). 

We can square the above fact (or more precisely, multiply the 
above matrix matrix by its adjoint) and conclude that | A(Ai , . . . , A„) | 2 
is the determinant of the matrix 

n-1 

(Y] ^i^j)l<ij<n- 
k=0 

More generally, if Pq(x), . . . , P n _i(x) are any sequence of polynomials, 
in which Pi(x) has degree i, then we see from row operations that the 
determinant of 

{Pj-l{\i))l<i,j<n 

is a non-zero constant multiple of A(A l7 . . . , A„) (with the constant 
depending on the leading coefficients of the Pi), and so the determi- 
nant of 

n-1 

1 < i , j < n 

fc=0 
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is a non-zero constant multiple of |A(Ai, . . . , A„)| 2 . Comparing this 
with (2.127), we obtain the formula 

n-l 

p n (X) - Cdet(^P fc (A 4 )e- A '/ 4 P fe (A J )e- A ?/ 4 ) 1 < JJ < n 

fc=0 

for some non-zero constant C. 

This formula is valid for any choice of polynomials Pi of de- 
gree i. But the formula is particularly useful when we set Pi equal 
to the (normalised) Hermite polynomials, defined 50 by applying the 
Gram-Schmidt process in L 2 (R) to the polynomials x l e~ x / 4 for i = 
0, . . . ,n — 1 to yield Pi(x)e~ x / 4 . In that case, the expression 

n-l 

(2.141) K n (x,y) :^P fc We- l! /^ fcfe ) e -v ! /4 

k=0 

becomes the integral kernel of the orthogonal projection r K Vn operator 
in £ 2 (R) to the span of the x % e~ x / 4 , thus 

Kv n f{x) = I K n (x,y)f(y) dy 

JR 

for all / G L 2 (R), and so /)„(A) is now a constant multiple of 
det(^„(A i ,A J )) 

The reason for working with orthogonal polynomials is that we 
have the trace identity 

(2.142) / K n (x,x) dx = tr(7ry n ) = n 
and the reproducing formula 

(2.143) K n (x,y)= [ K n (x, z)K n (z,y) dz 

Jtl 

which reflects the identity irv n — ^y n - These two formulae have an 
important consequence: 

^Equivalcntly, the Pi arc the orthogonal polynomials associated to the measure 
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Lemma 2.6.5 (Determinantal integration formula). Let K n : R x 

R — > R be any symmetric rapidly decreasing function obeying (2.142), 

(2.143). Then for any k>0, one has 

(2.144) 

/ dct(K n (Xi, Xj))i<ij<k+i d\ k+1 = (n- k) dct(K n (Xi, Xj))i<i,j<k- 

Remark 2.6.6. This remarkable identity is part of the beautiful 
algebraic theory of determinantal processes, which is discussed further 
in [Ta2010b, §2.6]. 

Proof. We induct on k. When k = this is just (2.142). Now assume 
that k > 1 and that the claim has already been proven for k — 1. 
We apply cof actor expansion to the bottom row of the determinant 
det(K n (Xi, Xj))i<ij<k+i- This gives a principal term 

(2.145) det(K n (Xi, Xj))i<ij< k K n (X k+ i, X k+1 ) 

plus a sum of k additional terms, the I th term of which is of the form 

(2.146) {-l) k+1 - l K n {X u A fc+ i) det(K n (Xi, Xj))i<i<k;i<j<k+v,&i- 

Using (2.142), the principal term (2.145) gives a contribution of ndet(K n (Xi, Xj))i<ij<k 
to (2.144). For each nonprincipal term (2.146), we use the multilin- 
earity of the determinant to absorb the K n (Xi, X k+ i) term into the 
j = k + 1 column of the matrix. Using (2.143), we thus see that the 
contribution of (2.146) to (2.144) can be simplified as 

+i- det((K n (Xi,Xj))i<i<k-i<j<k;j^u (K n (Xi, Xi))i<i< k ) 

which after row exchange, simplifies to — det(K n (Xi, Xj))i<ij< k . The 
claim follows. □ 



In particular, if we iterate the above lemma using the Fubini- 
Tonelli theorem, we see that 

/ det(K n (Xi, Xj))i<ij< n dXi... dX n = n\. 

On the other hand, if we extend the probability density function 
p n (Xi, . . . , A„) symmetrically from the Weyl chamber R™ to all of 
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R™, its integral is also n\. Since det(K n (Xi, Xj))i<ij< n is clearly sym- 
metric in the Ai, . . . , A„, we can thus compare constants and conclude 
the Gaudin-Mehta formula [MeGal960] 

p n (Xi, . . . , A„) = dct(K n (Xi, Xj))i<ij< n . 

More generally if we define pk : R fe — > R + to be the function 

(2.147) p*i(Ai, . . . , A fe ) = det(K n (X l7 Aj))i<i,j< fc , 

then the above formula shows that /9fe is the k-point correlation func- 
tion for the spectrum, in the sense that 

(2.148) / Pk(Xi, . . . , X k )F(Xi, . . . , A fe ) dX\ . . . dX k 
= E ]T F(Xi 1 (M n ), . . . ,Xi k (M n )) 

l<ii<...<i fc <n 

for any test function F : R k — > C supported in the region {(xi, . . . , Xfc) : 

Xl < . . . < Xfe}. 

In particular, if we set k = 1, we obtain the explicit formula 

E^m = —K n {x,x) dx 
n 

for the expected empirical spectral measure of M n . Equivalently after 
renormalising by y/n, we have 

(2.149) Ep Mn/VFl = -J^K n (y/nx, *Jn~x) dx. 

It is thus of interest to understand the kernel K n better. 

To do this, we begin by recalling that the functions Pi(x)e~ x2 / 4 
were obtained from x l e~ x / 4 by the Gram-Schmidt process. In partic- 
ular, each Pi(x)e~ x / 4 is orthogonal to the x^e~ x > /4 for all < j < i. 
This implies that xPi(x)e~ x / 4 is orthogonal to x^e~ x / 4 for < j < 
i — 1. On the other hand, xPi(x) is a polynomial of degree i + 1, so 
xPi(x)e~ x / 4 must lie in the span of x^e~ x / 4 for < j < i + Com- 
bining the two facts, we see that xPi must be a linear combination of 
Pi-i, Pi, Pi+i, with the P i+ i coefficient being non-trivial. We rewrite 
this fact in the form 



(2.150) 



P i+ i{x) = ((ax + bi)Pi(x) - c l P i - 1 (x) 
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for some real numbers cij, 6j, Cj (with c = 0). Taking inner products 



with P i+ i and Pj_ 


-i we see that 




(2.151) 


/ xP t {x)P l+1 {x)e-*' ''I 2 dx = 


1 


JR 


Oti 


and 








/ xPi^P^x^-* 2 1 2 dx = 


Ci 






Chi 


and so 






(2.152) 


Cj := 





(with the convention a_i = oo). 

We will continue the computation of <Zj , 6j, Cj later. For now, we 
we pick two distinct real numbers x, y and consider the Wronskian- 
type expression 

P i+1 (x)Pi(y) - P^Pi+^y). 
Using (2.150), (2.152), we can write this as 

Oi(x - y)P t {x)P t {y) + -^(Pi^Piiy) - P(z)P-i(y)) 

CLi-l 

or in other words 

ai(x-y) 
_ PjjxJP^jy) - Pi-i(x)Pj(y) 
ai-i(x-y) 

We telescope this and obtain the Christoffel-Darboux formula for the 
kernel (2.141): 

(2.153) K n (x,y) = f »MUl()-f.-iMP.M e - W )/4 

a„_i(a;-y) 

Sending y — > .t using L 'Hopital 's rule, we obtain in particular that 

(2.154) Knfoa;) = — KWA-iW - P I ' l _ 1 (a;)P„( a; )) e - a;2 / 2 . 

Inserting this into (2.149), we see that if we want to understand 
the expected spectral measure of GUE, we should understand the 
asymptotic behaviour of P n and the associated constants a n . For 
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this, we need to exploit the specific properties of the Gaussian weight 
e -x /2 j n p ar ticular, we have the identity 

(2.155) xe -*/* = * e -*l* 

ax 

so upon integrating (2.151) by parts, we have 

{Pl(x)P l+1 {x) + P^^+iO^e-* 2 / 2 dx=-. 



/R a i 

As P[ has degree at most i — 1, the first term vanishes by the or- 
thonormal nature of the Pi(x)e~ x / 4 , thus 

(2.156) [ Pi{x)Pl +1 (x)e- x2 / 2 dx= -. 

Jr. a i 

To compute this, let us denote the leading coefficient of Pi as fcj. Then 
P- +1 is equal to (?+l)h±±p. pms lower-order terms, and so we have 

(j + = 1 

fci Gi 

On the other hand, by inspecting the x l+1 coefficient of (2.150) we 
have 

Combining the two formulae (and making the sign convention that 
the ki are always positive), we see that 

1 

and 

ki 

Meanwhile, a direct computation shows that Pq(x) = k Q = , 2 ^ 1/4 , 
and thus by induction 

fc ,,- := 



ki+i — 



A similar method lets us compute the &j. Indeed, taking inner prod- 
ucts of (2.150) with Pi{x)e- X I 2 and using orthonormality we have 



b t = I xPi(x) 2 e x2/2 dx 
Jr 
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which upon integrating by parts using (2.155) gives 
h = -2a, [ P l (x)P^(x)e- x2/2 dx. 

JR 

As P[ is of degree strictly less than i, the integral vanishes by or- 
thonormality, thus &j = 0. The identity (2.150) thus becomes Hermite 
recurrence relation 

(2.157) P l+1 (x) = -^xPtix) - -^L=P^{x). 

Vi + 1 \A + 1 

Another recurrence relation arises by considering the integral 

P 3 {x)P' l+1 {x)e- x2 1 2 dx. 



L 



On the one hand, as P/ +1 has degree at most i, this integral vanishes 
if j > i by orthonormality. On the other hand, integrating by parts 
using (2.155), we can write the integral as 

(xP 3 -P')(x)P l+1 (x)e- x ^ 2 dx. 



R 

If j < i, then xPj — Pj has degree less than i + 1, so the integral again 
vanishes. Thus the integral is non- vanishing only when j = i. Using 
(2.156), we conclude that 

(2.158) P/ +1 = -P = y/i + lPi. 

We can combine (2.158) with (2.157) to obtain the formula 

A (e - 2 /2p. (a;)) = -y/i + i e -*/*P i+1 (x), 
ax 

which together with the initial condition P = ^ 2 ^ 1/4 gives the ex- 
plicit representation 

(2 159) P (x) — e x 2 /2 ^ -x 2 /2 

(2.159) P n[ x) .- ^ /4V _e 

for the Hermite polynomials. Thus, for instance, at x = one sees 
from Taylor expansion that 



f 1 W2 

(2-160) P„(0) = 74l7iWw2)! ; ^ (0) =° 
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when n is even, and 

(_l)(«+i)/2( n + i)V^T 



(2.161) P n (0) = 0; f*(0) = 



(27r) 1 /4 2 ("+i)/2(( n + i)/2)! 
when n is odd. 

In principle, the formula (2.159), together with (2.154), gives us 
an explicit description of the kernel K n (x, x) (and thus of E/j, Mn 
by (2.149)). However, to understand the asymptotic behaviour as 
n — > oo, we would have to understand the asymptotic behaviour 
of -j-^e~ x ' 2 as n -> oo, which is not immediately discernable by 
inspection. However, one can obtain such asymptotics by a variety 
of means. We give two such methods here: a method based on ODE 
analysis, and a complex-analytic method, based on the method of 
steepest descent. 

We begin with the ODE method. Combining (2.157) with (2.158) 
we see that each polynomial P m obeys the Hermite differential equa- 
tion 

P£(x) - xP'Jx) + mP m {x) = 0. 

If we look instead at the Hermite functions <j) m {x) := P m (x)e~ x / 4 , 
we obtain the differential equation 

L(f) m (x) = (m + 7j)4>m 

where L is the harmonic oscillator operator 

14 := -4>" + 



x 2 



Note that the self-adjointness of L here is consistent with the orthog- 
onal nature of the <j> m . 

Exercise 2.6.5. Use (2.141), (2.154), (2.159), (2.157), (2.158) to 
establish the identities 

n-l 

K n (x,x) = ^ 4> 3 {x) 2 

3=0 
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and thus by (2.149) 




It is thus natural to look at the rescaled functions 
4> m {x) := Vn&m{Vnx) 
which are orthonormal in L 2 (R) and solve the equation 

L l/^a<Pm{X) = (p m 

where L h is the semiclassical harmonic oscillator operator 
L h <p := -h 2 4>" + ^(f>, 

thus 

_^ n— 1 

E/x M „/^ = - ^i x f dx = 

3=0 

(2.162) =[U' n { X f + {l- X ^)~4> n (x) 2 ]dx. 

The projection 7Ty„ is then the spectral projection operator of 
^l/v/n to [0,1]. According to semi-classical analysis, with h being 
interpreted as analogous to Planck's constant, the operator Lh has 
symbol p 2 + where p := —ih^ is the momentum operator, so the 
projection 7iv n is a projection to the region {(x,p) : p 2 + ^- < 1} 
of phase space, or equivalently to the region : |p| < (4 — 

a;2 )+ 2 }- I n the semi-classical limit h — > 0, we thus expect the diagonal 
K n (x,x) of the normalised projection h 2 nv n to be proportional to 
the projection of this region to the a; variable, i.e. proportional to 
(4-x 2 ) 1 / 2 . Wc are thus led to the semicircular law via semi-classical 
analysis. 

It is possible to make the above argument rigorous, but this would 
require developing the theory of microlocal analysis, which would be 
overkill given that we are just dealing with an ODE rather than a 
PDE here (and an extremely classical ODE at that); but see Section 
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3.3. We instead use a more basic semiclassical approximation, the 
WKB approximation, which we will make rigorous using the classical 
method of variation of parameters (one could also proceed using the 
closely related Priifer transformation, which we will not detail here). 
We study the eigenfunction equation 

L h <j> = \<p 

where we think of h > as being small, and A as being close to 1. 
We rewrite this as 

(2.163) <f>" = ~^k{x) 2 4> 

where k{x) := yj\ — x 2 /4, where we will only work in the "classical" 
region x 2 /4 < A (so k(x) > 0) for now. 

Recall that the general solution to the constant coefficient ODE 
4>" = -h k2 $ is S iven b y = Ae lkx ' h + Be-' Lkx / h . Inspired by 

this, we make the ansatz 

(j>{x) = A{x)e^^l h + B(x)e-^ (x)/h 

where &(x) :— J Q X k(y) dy is the antiderivative of k. Differentiating 
this, we have 

0'(a;) = l J^l( A (x)e m ^/ h - B(x)e-^^/ h ) 

+A'{x)e^^' h + B'(x)e-™W/ h . 
Because we are representing a single function <f> by two functions 
A, B, we have the freedom to place an additional constraint on A, B. 
Following the usual variation of parameters strategy, we will use this 
freedom to eliminate the last two terms in the expansion of <j), thus 

(2.164) A'{x)e l * {x)/h + B'(x)e-^^/ h = 0. 
We can now differentiate again and obtain 

= + lk M(A(x)e^ x y h - B(x)e-^^' h ) 

h z h 

+ l ]^± {A > {x yH*)/h _ B ' (l ) e -»(x)A). 
Comparing this with (2.163) we see that 

A'{x)e^^' h -B'{x)e-^' h = - 1 ^l{A(x)e^l h -B{x)e-^^l h ). 
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Combining this with (2.164), we obtain equations of motion for A and 
B: 

B, ^- k M B{i)+k M Mx) ' w ' )IK 

We can simplify this using the integrating factor substitution 

A(x) = k(x)- 1/2 a(x); B(x) = k(x)- 1/2 b{x) 

to obtain 

(2.165) a'(x) = ^b{x)e-^l h ; 

(2.166) b'(x) = *M a(af)e *»(.)/\ 

The point of doing all these transformations is that the role of the h 
parameter no longer manifests itself through amplitude factors, and 
instead only is present in a phase factor. In particular, we have 

a' ,b' = 0(\a\ + \b\) 

on any compact interval / in the interior of the classical region x 2 /4 < 
A (where we allow implied constants to depend on I) , which by Gron- 
wall's inequality gives the bounds 

a'(x),b'(x),a{x),b(x) = O(\a(0)\ + \b(0)\) 

on this interval /. We can then insert these bounds into (2.165), 
(2.166) again and integrate by parts (taking advantage of the non- 
stationary nature of ^) to obtain the improved bounds 51 
(2.167) 

a(x) = o(0)+O(ft(|o(0)| + |6(0)|)); b(x) = 6(0)+O(/i(|a(0)| + |6(0)|)) 

on this interval. This is already enough to get the asymptotics that 
we need: 



^More precise asymptotic expansions can be obtained by iterating this procedure, 
but we will not need them here. 
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Exercise 2.6.6. Use (2.162) to show that on any compact interval I 
in (—2, 2), the density of E/j, Mn /^ is given by 

(\a\ 2 (x) + \b\ 2 (x))(^x 2 T4 + o(l)) +0(\a(x)\\b(x)\) 

where a,b are as above with A = 1 + ^ and h = ^. Combining 
this with (2.167), (2.160), (2.161), and Stirling's formula, conclude 
that ~Efi Mn /^ converges in the vague topology to the semicircular 
law ^(4 — x 2 ) 1 / 2 dx. (Note that once one gets convergence inside 
(—2,2), the convergence outside of [—2,2] can be obtained for free 
since ^M n /^n an d 5^(4 — x2 ) + 2 dx are both probability measures.) 

We now sketch out the approach using the method of steepest 
descent. The starting point is the Fourier inversion formula 

e-* 2 / 2 = -±= f e^e- t2 / 2 dt 

which upon repeated differentiation gives 



dx 

and thus by (2.159) 



jn -n r 



and thus 



where 



P n {x) = ( P— { fe-C-"' 2 / 2 dt 
1 ( x ) - n (™+ 1 )/ 2 / e" 0(t) dt 



4>(t) := logt - (t - ix) 2 /2 - x 2 /A 

where we use a suitable branch of the complex logarithm to handle 
the case of negative t. 

The idea of the principle of steepest descent is to shift the contour 
of integration to where the real part of <p(z) is as small as possible. 
For this, it turns out that the stationary points of (f)(z) play a crucial 
role. A brief calculation using the quadratic formula shows that there 
are two such stationary points, at 



z = 



ix ± V4 — x 2 



2.6. Gaussian ensembles 



243 



When \x\ < 2, (ft is purely imaginary at these stationary points, while 
for |x| > 2 the real part of (ft is negative at both points. One then 
draws a contour through these two stationary points in such a way 
that near each such point, the imaginary part of (ft(z) is kept fixed, 
which keeps oscillation to a minimum and allows the real part to decay 
as steeply as possible (which explains the name of the method) . After 
a certain tedious amount of computation, one obtains the same type 
of asymptotics for <f>„ that were obtained by the ODE method when 
|x| < 2 (and exponentially decaying estimates for |x| > 2). 

Exercise 2.6.7. Let / : C — > C, g : C — » C be functions which are 
analytic near a complex number zq, with f'(zo) = and f"(zo) =/= 0. 
Let e > be a small number, and let 7 be the line segment {zq + tv : 
—e<t< e}, where v is a complex phase such that f"(z )v 2 is a 
negative real. Show that for e sufficiently small, one has 



as A — > +00. This is the basic estimate behind the method of steepest 
descent; readers who are also familiar with the method of stationary 
phase may sec a close parallel. 

Remark 2.6.7. The method of steepest descent requires an explicit 
representation of the orthogonal polynomials as contour integrals, and 
as such is largely restricted to the classical orthogonal polynomials 
(such as the Hcrmite polynomials). However, there is a non-linear 
generalisation of the method of steepest descent developed by Dcift 
and Zhou, in which one solves a matrix Riemann-Hilbcrt problem 
rather than a contour integral; see [Del999] for details. Using these 
sorts of tools, one can generalise much of the above theory to the spec- 
tral distribution of J7(n)-conjugation-invariant discussed in Remark 
2.6.2, with the theory of Hcrmite polynomials being replaced by the 
more general theory of orthogonal polynomials; this is discussed in 
[Del999] or [DeGi2007]. 

The computations performed above for the diagonal kernel K n (x, x) 
can be summarised by the asymptotic 

K n (y/nx, y/nx) = y/n(p sc (x) + o(l)) 




\/2nv 



e xf(zo) g(zo) 
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whenever xeRis fixed and n — > oo, and p sc (x) := ^ (4 — £ 2 )+ 2 is 
the semicircular law distribution. It is reasonably straightforward to 
generalise these asymptotics to the off-diagonal case as well, obtaining 
the more general result 
(2.168) 

K n (y/nx + Vl y/^x+ V2 ) = y /n(p sc (x)K(y 1 ,y 2 ) + o(l)) 
y/np sc (x) \fnPsc\X) 

for fixed x € (—2, 2) and yi,y 2 € R, where if is the Dyson sine kernel 

sin(7r(j/i - y 2 ) 
K(y 1 ,y 2 ) ■= — t ■ 

7T(yi - V2) 

In the language of semi-classical analysis, what is going on here is 
that the rescaling in the left-hand side of (2.168) is transforming the 
phase space region {(x,p) : p 2 + \ < 1} to the region {(x,p) : \p\ < 1} 
in the limit n — > oo, and the projection to the latter region is given 
by the Dyson sine kernel. A formal proof of (2.168) can be given 
by using either the ODE method or the steepest descent method 
to obtain asymptotics for Hermite polynomials, and thence (via the 
Christoffel-Darboux formula) to asymptotics for K n ; we do not give 
the details here, but see for instance [AnGuZi2010]. 

From (2.168) and (2.147), (2.148) we obtain the asymptotic for- 
mula 

E F{y/np sc {x){\ il (M n ) - y/nx), . . . , 

\<i\<...<i k <n 

\fnp sc {x){\ ik {M n ) - y/nx)) 
-> / F (yi, ■ ■ ■ ,Vk) det(K(yi, yj))i<i.j< k dyi . . . dy k 

for the local statistics of eigenvalues. By means of further algebraic 
manipulations (using the general theory of determinantal processes), 
this allows one to control such quantities as the distribution of eigen- 
value gaps near y/nx, normalised at the scale ^ , which is the 
average size of these gaps as predicted by the semicircular law. For in- 
stance, for any s > 0, one can show (basically by the above formulae 
combined with the inclusion-exclusion principle) that the proportion 
of eigenvalues with normalised gap V^ p'^t ^) ^ ess than so con- 
verges as n — > oo to det(l — K) L 2[ >s ] ds, where t c G [—2,2] 
is defined by the formula j^Pscix) dx = c, and K is the integral 
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operator with kernel K(x, y) (this operator can be verified to be trace 
class, so the determinant can be defined in a Fredholm sense). See 
for instance 52 [Me2004]. 

Remark 2.6.8. One can also analyse the distribution of the eigenval- 
ues at the edge of the spectrum, i.e. close to ±2^/n. This ultimately 
hinges on understanding the behaviour of the projection 7Ty n near the 
corners (0, ±2) of the phase space region = {(p, x) : p 2 + ^- < 1}, or 
of the Hermite polynomials P n (x) for x close to ±2y/n. For instance, 
by using steepest descent methods, one can show that 

n 1 / 12 0„(2^+^)^Ai(x) 

asn->oo for any fixed x, y, where Ai is the Airy function 

1 f°° t 3 
Ai(x) := / cos( h tx) dt. 

t Jo 3 

This asymptotic and the Christoffel-Darboux formula then gives the 
asymptotic 

(2.169) n 1 ' 6 K n {2^i + -^,2^ + -f^) K Ai (x,y) 

for any fixed x,y, where K^i is the Airy kernel 

Ai(x) Ai'(y)-Ai'(x)Ai(y) 



K Ai (x,y) : = 



x-y 



This then gives an asymptotic description of the largest eigenvalues 
of a GUE matrix, which cluster in the region 2yfn + 0(n 1 / 6 ). For 
instance, one can use the above asymptotics to show that the largest 
eigenvalue Ai of a GUE matrix obeys the Tracy- Widom law 



for any fixed t, where A is the integral operator with kernel Kai ■ See 
[AnGuZi2010l and Section 3.3 for further discussion. 



A finitary version of this inclusion-exclusion argument can also be found at 
[Ta2010b, §2.6]. 
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2.6.5. Determinantal form of the Gaussian matrix distribu- 
tion. One can perform an analogous analysis of the joint distribu- 
tion function (2.136) of Gaussian random matrices. Indeed, given 
any family P , . . . , P n _i(z) of polynomials, with each Pi of degree i, 
much the same arguments as before show that (2.136) is equal to a 
constant multiple of 

n-l 

det(^P fe (A t ) e - |A * |2/2 /M^) e - |A ' |2/2 )i< iJ <„. 

k=0 

One can then select Pfc(z)e~l z l 2 / 2 to be orthonormal in L 2 (C). Actu- 
ally in this case, the polynomials are very simple, being given explic- 
itly by the formula 

Pk(z) ■= -±=z k . 
V7r/e! 

Exercise 2.6.8. Verify that the Pfe(z)e _ ' z ' / 2 arc indeed orthonor- 
mal, and then conclude that (2.136) is equal to det(K n (Xi, \j))i<ij< n , 
where 

K n {z,w) := I e -(N 2 + H 2 )/ 2 yM!. 

k=0 

Conclude further that the m-point correlation functions p m {zi, . . . , z m ) 
are given as 

p m (zi, ■■■,z m ) = det(K n (zi, Zj))i< itj < m . 
Exercise 2.6.9. Show that as n — > oo, one has 

nKni^/nz, y/nz) = -1m<i + o(l) 

7T 

and deduce that the expected spectral measure EyU G/ /y^ converges 
vaguely to the circular measure p c := ^l| z |<i dz; this is a special 
case of the circular law. 

Exercise 2.6.10. For any \z\ < 1 and W\,W2 € C, show that 
nK n (y/n(z + tui), y/n(z + w 2 )) = - exp(-|tui - w 2 \ 2 /2) + o(l) 

7T 

as n — > oo. This formula (in principle, at least) describes the asymp- 
totic local m-point correlation functions of the spectrum of Gaussian 
matrices. 
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Remark 2.6.9. One can use the above formulae as the starting point 
for many other computations on the spectrum of random Gaussian 
matrices; to give just one example, one can show that expected num- 
ber of eigenvalues which are real is of the order of \/n (see [Edl996] 
for more precise results of this nature). It remains a challenge to 
extend these results to more general ensembles than the Gaussian 
ensemble. 

2.7. The least singular value 

Now we turn attention to another important spectral statistic, the 
least singular value a n (M) of an n x n matrix M (or, more generally, 
the least non-trivial singular value o~ p (M) of a n x p matrix with 
p < n). This quantity controls the invertibility of M. Indeed, M is 
invertible precisely when a n (M) is non-zero, and the operator norm 
||A^ -1 ||op of M _1 is given by \/a n (M). This quantity is also related 
to the condition number a 1 (M)/a n (M) = HMH^HM"^!^ of M, 
which is of importance in numerical linear algebra. As we shall see in 
Section 2.8, the least singular value of M (and more generally, of the 
shifts -j^M — zl for complex z) will be of importance in rigorously 
establishing the circular law for iid random matrices M, as it plays 
a key role in computing the Stieltjes transform ^ tr(^M — zl)^ 1 of 
such matrices, which as we have already seen is a powerful tool in 
understanding the spectra of random matrices. 

The least singular value 



which sits at the "hard edge" of the spectrum, bears a superficial 
similarity to the operator norm 



at the "soft edge" of the spectrum, that was discussed back in Section 
2.3, so one may at first think that the methods that were effective 
in controlling the latter, namely the epsilon-net argument and the 
moment method, would also work to control the former. The epsilon- 
net method does indeed have some effectiveness when dealing with 
rectangular matrices (in which the spectrum stays well away from 



a n (M) 




\\M\\ 



op — 



oi (M) = sup ||Mx|| 

||*||=i 
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zero), but the situation becomes more delicate for square matrices; 
it can control some "low entropy" portions of the infimum that arise 
from "structured" or "compressible" choices of x, but are not able to 
control the "generic" or "incompressible" choices of x, for which new 
arguments will be needed. As for the moment method, this can give 
the coarse order of magnitude (for instance, for rectangular matrices 
with p — yn for < y < 1, it gives an upper bound of (1 — ^fy + o(l))n 
for the singular value with high probability, thanks to the Marchenko- 
Pastur law), but again this method begins to break down for square 
matrices, although one can make some partial headway by considering 
negative moments such as trM~ 2 , though these are more difficult to 
compute than positive moments tr M k . 

So one needs to supplement these existing methods with addi- 
tional tools. It turns out that the key issue is to understand the 
distance between one of the n rows X\ , . . . , X n e C" of the matrix 
M, and the hyperplane spanned by the other n — 1 rows. The rea- 
son for this is as follows. First suppose that cr n (M) = 0, so that M 
is non-invertible, and there is a linear dependence between the rows 
X\, . . . , X n . Thus, one of the Xi will lie in the hyperplane spanned 
by the other rows, and so one of the distances mentioned above will 
vanish; in fact, one expects many of the n distances to vanish. Con- 
versely, whenever one of these distances vanishes, one has a linear 
dependence, and so cr„(M) = 0. 

More generally, if the least singular value <J n {M) is small, one 
generically expects many of these n distances to be small also, and 
conversely. Thus, control of the least singular value is morally equiv- 
alent to control of the distance between a row Xi and the hyperplane 
spanned by the other rows. This latter quantity is basically the dot 
product of Xi with a unit normal rii of this hyperplane. 

When working with random matrices with jointly independent 
coefficients, we have the crucial property that the unit normal rii 
(which depends on all the rows other than Xi) is independent of Xi, 
so even after conditioning to be fixed, the entries of X^ remain 
independent. As such, the dot product Xi • n ; is a familiar scalar 
random walk, and can be controlled by a number of tools, most no- 
tably Littlewood-Offord theorems and the Berry-Esseen central limit 
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theorem. As it turns out, this type of control works well except in 
some rare cases in which the normal rii is "compressible" or otherwise 
highly structured; but epsilon-net arguments can be used to dispose 
of these cases 53 . 

These methods rely quite strongly on the joint independence on 
all the entries; it remains a challenge to extend them to more general 
settings. Even for Wigner matrices, the methods run into difficulty 
because of the non-independence of some of the entries (although it 
turns out one can understand the least singular value in such cases 
by rather different methods). 

To simplify the exposition, we shall focus primarily on just one 
specific ensemble of random matrices, the Bernoulli ensemble M = 
(£ij)i<i,j<n OI random sign matrices, where £y = ±1 are independent 
Bernoulli signs. However, the results can extend to more general 
classes of random matrices, with the main requirement being that 
the coefficients are jointly independent. 

2.7.1. The epsilon-net argument. We begin by using the epsilon 
net argument to establish a lower bound in the rectangular case, first 
established in [LiPaRuTo2005]: 

Theorem 2.7.1 (Lower bound). Let M = (^ij)i<i< p -i<j< n be annx 
p Bernoulli matrix, where Ip < (1 — S)n for some 5 > (independent 
ofn). Then with exponentially high probability (i.e. 1 — 0(e~ cn ) for 
some c > 0), one has a p (M) > Cy/n, where c > depends only on 8. 

This should be compared with the upper bound established in 
Section 2.3, which asserts that 

(2.170) ||M||op = <ri(M)<CVS 

holds with overwhelming probability for some absolute constant C 
(indeed, one can take any C > 2 here). 

We use the epsilon net argument introduced in Section 2.3, but 
with a smaller value of e > than used for the largest singular value. 



This general strategy was first developed for the technically simpler singular- 
ity problem in [Kol967], and then extended to the least singular value problem in 
[Ru2008]. 
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We write 



aJM)= inf IIMdl. 

xdCP:\\x\\ = l 



Taking S to be a maximal e-net of the unit sphere in C p , with e > 
to be chosen later, we have that 

a p (M)> inf || AfxH - e||Af Hop 

and thus by (2.170), we have with overwhelming probability that 
aJM) > inf || Mx \\ - Csy/n, 

and so it suffices to show that 

P(inf ||Mx|| < 2Ce^/n) 

is exponentially small in n. From the union bound, we can upper 
bound this by 

5^P(||Ma;|| < 2Ce^fn). 

ices 

From the volume packing argument we have 

(2.171) |E| < 0(l/e) p < 0{l/e) {1 - S)n . 

So we need to upper bound, for each x e S, the probability 

P(||Ma;|| < ICe^/n). 
If we let Y\, . . . , Y n e C p be the rows of M, we can write this as 

n 

i=i 

By Markov's inequality (1.14), the only way that this event can hold 
is if we have 

\Yj ■ x\ 2 < 8C 2 e 2 

for at least n/2 values of j. We do not know in advance what the set 
of j is for which this event holds. But the number of possible values 
of such sets of j is at most 2™. Applying the union bound (and paying 
the entropy cost of 2") and using symmetry, we may thus bound the 
above probability by 54 

< 2 n P(\Yj ■ x\ 2 < 8C 2 e 2 for 1 < j < n/2). 



^We will take n to be even for sake of notation, although it makes little essential 
difference. 
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Now observe that the random variables Yj ■ x are independent, and 
so we can bound this expression by 

< 2 n P(\Y-x\ < V8Ce) n/2 

where Y = (£1, . . . , £ n ) is a random vector of iid Bernoulli signs. 

We write x = (xi, . . . , x n ), so that Y ■ x is a random walk 

Y ■ x = H h£„x„. 

To understand this walk, we apply (a slight variant) of the Berry- 
Esseen theorem from Section 2.2: 

Exercise 2.7.1. Show 55 that 

r 1 -A 

supP(|y • x - t\ < r) « — - 2 + jj-p £ \xtf 

for any r > and any non-zero x. {Hint: first normalise ||a;|| = 1, 
then adapt the proof of the Berry-Esseen theorem.) 

Conclude in particular that if 

j:\xj\<e l °° 

(say) then 

su P P(|T-x-t| < V8Ce) <e. 
t 

{Hint: condition out all the Xj with \xj\ > 1/2.) 
Let us temporarily call x incompressible if 

E N 2 < £l ° 

j:|x 3 -|< £ ioo 

and compressible otherwise. If we only look at the incompressible 
elements of E, we can now bound 

P(||Mx|| < 2CeVn~) < 0{e) n , 

and comparing this against the entropy cost (2.171) we obtain an 
acceptable contribution for e small enough (here we are crucially using 
the rectangular condition p < (1 — S)n). 

^Actually, for the purposes of this section, it would suffice to establish 
a weaker form of the Bcrry-Essccn theorem with X]f=i \ x j \^ /\\ x \\^ replaced by 
(£?=l I^| 3 /I!^!! 3 ) C for any fixed c > 0. 
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It remains to deal with the compressible vectors. Observe that 
such vectors lie within e of a sparse unit vector which is only sup- 
ported in at most e~ 200 positions. The e-entropy of these sparse 
vectors (i.e. the number of balls of radius e needed to cover this 
space) can easily be computed to be of polynomial size 0(n° E ^) in 
n. Meanwhile, we have the following crude bound: 

Exercise 2.7.2. For any unit vector x, show that 

P(\Y -x\ <k)<1-k 

for k > small enough. (Hint: Use the Paley-Zygmund inequality, 
Exercise 1.1.9. Bounds on higher moments on \Y ■ x\ can be obtained 
for instance using Hoeffding's inequality, or by direct computation.) 
Use this to show that 

P(||Ma;|| < 2Cey/n) < exp(-cn) 

for all such x and e sufficiently small, with c > independent of e 
and n. 

Thus the compressible vectors give a net contribution of 0(n° E ^) x 
cxp(— cn), which is acceptable. This concludes the proof of Theorem 
2.7.1. 

2.7.2. Singularity probability. Now we turn to square Bernoulli 
matrices M = (£,ij)i<i.j< n - Before we investigate the size of the least 
singular value, we first tackle the easier problem of bounding the 
singularity probability 

PK(M) = 0), 

i.e. the probability that M is not invertible. The problem of comput- 
ing this probability exactly is still not completely settled. Since M is 
singular whenever the first two rows (say) are identical, we obtain a 
lower bound 

PK(M) = 0) > ^, 

and it is conjectured that this bound is essentially tight in the sense 
that 

PK(Af) = o) = (i+ (i)r, 
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but this remains open; the best bound currently is [BoVuWo2010], 

and gives 

PK(M)=o)<(i+o(i)r. 

We will not prove this bound here, but content ourselves with a weaker 
bound, essentially due to Koml6s[Kol967]: 

Proposition 2.7.2. We have P(a n (M) = 0) < 1/n 1 / 2 . 

To show this, we need the following combinatorial fact, due to 
Erd6s[Erl945]: 

Proposition 2.7.3 (Erdos Littlewood-Offord theorem). Let x = 
(xi, . . . ,x n ) be a vector with at least k nonzero entries, and let Y = 
(£i, . . . , £„) be a random vector of iid Bernoulli signs. Then P(Y ■ x = 

0) < k- 1 / 2 . 

Proof. By taking real and imaginary parts we may assume that x 
is real. By eliminating zero coefficients of x we may assume that 
k = n; reflecting we may then assume that all the Xi are positive. 
Observe that the set of Y = (£i, . . . ,£„) G {-1, 1}" with Y ■ x = 
forms an antichain 56 in {—1, 1}™ with the product partial ordering. 
The claim now easily follows from Sperner's theorem and Stirling's 
formula (Section 1.2). □ 

Note that we also have the obvious bound 
(2.172) P(Y • x = 0) < 1/2 

for any non-zero x. 

Now we prove the theorem. In analogy with the arguments of 
Section 2.7, we write 

P(<7„(M) = 0) = P(Mx = for some nonzero ieC") 

(actually we can take x € R" since M is real). We divide into com- 
pressible and incompressible vectors as before, but our definition of 
compressibility and incompressibility is slightly different now. Also, 

^An antichain in a partially ordered set X is a subset S of X such that no two 
elements in S are comparable in the order. The product partial ordering on { — 1, l} n 
is denned by requiring (xi,...,x n ) < (2/1, . . . , y n ) iff' Xi < yi for all i. Sperner's 
theorem asserts that all anti-chains in { — 1, l} n have cardinality at most (i n / 2 j)' 
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one has to do a certain amount of technical maneuvering in order to 
preserve the crucial independence between rows and columns. 

Namely, we pick an e > and call x compressible if it is supported 
on at most en coordinates, and incompressible otherwise. 

Let us first consider the contribution of the event that Mx = 
for some nonzero compressible x. Pick an x with this property which 
is as sparse as possible, say k sparse for some 1 < k < en. Let us 
temporarily fix k. By paying an entropy cost of wc ma y 

assume that it is the hrst k entries that are non-zero for some 1 < 
k < en. This implies that the first k columns Y\, . . . , Yfe of M have a 
linear dependence given by x; by minimality, Y\, . . . , Yk-\ are linearly 
independent. Thus, x is uniquely determined (up to scalar multiples) 
by Yi, . . . , Yfe. Furthermore, as the n x k matrix formed by Y\, . . . , Yfe 
has rank k~ 1 , there is some kxk minor which already determines x up 
to constants; by paying another entropy cost of , we may assume 
that it is the top left minor which does this. In particular, we can 
now use the first k rows X\, . . . , to determine x up to constants. 
But the remaining n — k rows are independent of Xi , . . . , X^ and still 
need to be orthogonal to x; by Proposition 2.7.3, this happens with 
probability at most 0(Vk)~( n ~ k \ giving a total cost of 



which by Stirling's formula (Section 1.2) is acceptable (in fact this 
gives an exponentially small contribution). 

The same argument gives that the event that y*M = for some 
nonzero compressible y also has exponentially small probability. The 
only remaining event to control is the event that Mx = for some 
incompressible x, but that Mz ^ and y*M ^ for all nonzero 
compressible z, y. Call this event E. 

Since Mx = for some incompressible x, we see that for at least 
en values of k € {1, . . . , n}, the row X^ lies in the vector space Vfe 
spanned by the remaining n — 1 rows of M. Let denote the event 
that E holds, and that X^ lies in Vfe; then we see from double counting 
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that 



1 



n 



P(E)< 



en 



k=l 



By symmetry, we thus have 



P(E) < -P(E n ). 



To compute P(E n ), we freeze X\, . . . , X n _i consider a normal vector 
x to V n -\\ note that we can select x depending only on X\, . . . , X n _\. 
We may assume that an incompressible normal vector exists, since 
otherwise the event E n would be empty. We make the crucial ob- 
servation that X n is still independent of x. By Proposition 2.7.3, we 
thus see that the conditional probability that X n ■ x = 0, for fixed 
X u . . . is O e (n- x l 2 ). We thus sec that P(E) <C £ 1/n 1 / 2 , and 

the claim follows. 

Remark 2.7.4. Further progress has been made on this problem 
by a finer analysis of the concentration probability P(Y ■ x = 0), 
and in particular in classifying those x for which this concentra- 
tion probability is large (this is known as the inverse Littlewood- 
Offord problem). Important breakthroughs in this direction were 
made by Halasz[Hal977] (introducing Fourier-analytic tools) and 
by Kahn, Komlos, and Szemeredi[KaKoSzl995] (introducing an ef- 
ficient "swapping" argument). In [TaVu2007] tools from additive 
combinatorics (such as Freiman's theorem) were introduced to ob- 
tain further improvements, leading eventually to the results from 
[BoVuWo2010] mentioned earlier. 

2.7.3. Lower bound for the least singular value. Now we re- 
turn to the least singular value cr„(M) of an iid Bernoulli matrix, 
and establish a lower bound. Given that there are n singular values 
between and cri(M), which is typically of size 0(^/n), one expects 
the least singular value to be of size about l/y/n on the average. An- 
other argument supporting this heuristic scomes from the following 
identity: 

Exercise 2.7.3 (Negative second moment identity). Let M be an 
invertible n x n matrix, let X\, . . . , X n be the rows of M, and let 
Ri, . . . , R n be the columns of M _1 . For each 1 < i < n, let Vi be the 
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hyperplane spanned by all the rows X\ , . . . , X n other than . Show 
that \\Ri\\ - dist(X i; Vi)- 1 and £™ =1 ^(M)~ 2 = £" =1 dist(Xj, K) 2 . 

From Talagrand's inequality (Theorem 2.1.13), we expect each 
dist(Xi,Vi) to be of size 0(1) on the average, which suggests that 
Y^i7—i a i{M)~ 2 = 0(n); this is consistent with the heuristic that the 
eigenvalues Oi{M) should be roughly evenly spaced in the interval 
[0, 2y/n\ (so that a n -i(M) should be about (i + l)/y/n). 

Now we give a rigorous lower bound: 

Theorem 2.7.5 (Lower tail estimate for the least singular value). 
For any A > 0, one has 

P(<7„(M) < X/y/n) <C o\^ (l) + 

(i) 

where o e ^o(l) 9 oes to zero as A — > uniformly in n, and o r n. oo; A(l) 
goes to zero as n — > oo /or eac/i /ised A. 

This is a weaker form of a result of Rudelson and Vershynin[RuVe2008] 
(which obtains a bound of the form 0(\) + 0(c n ) for some c < 1), 
which builds upon the earlier works [Ru2008], [TaVu2009], which 
obtained variants of the above result. 

The scale l/i/n that we are working at here is too fine to use 
cpsilon net arguments (unless one has a lot of control on the en- 
tropy, which can be obtained in some cases thanks to powerful inverse 
Littlewood-Offord theorems, but is difficult to obtain in general.) We 
can prove this theorem along similar lines to the arguments in the 
previous section; we sketch the method as follows. We can take A to 
be small. We write the probability to be estimated as 

P(||Mx|| < \l\pa for some unit vector x <E C"). 

We can assume that ||M|| op < C^/n for some absolute constant C, as 
the event that this fails has exponentially small probability. 

We pick an e > (not depending on A) to be chosen later. We call 
a unit vector x € C n compressible if x lies within a distance e of a en- 
sparse vector. Let us first dispose of the case in which ||Mx|| < X^n 
for some compressible x. By paying an entropy cost of (^ e " j) , we may 
assume that x is within s of a vector y supported in the first [en\ 
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coordinates. Using the operator norm bound on M and the triangle 
inequality, we conclude that 

||My|| < (A + Ce)v^. 

Since y has norm comparable to 1, this implies that the least singular 
value of the first \en\ columns of M is 0((X + e)y/n). But by Theo- 
rem 2.7.1, this occurs with probability 0(exp(— en)) (if X,e are small 
enough) . So the total probability of the compressible event is at most 
(^ e ™ j)0(exp(— en)), which is acceptable if e is small enough. 

Thus we may assume now that || Ma;|| > X/y/n for all compressible 
unit vectors x; we may similarly assume that ||y*M|| > X/y/n for 
all compressible unit vectors y. Indeed, we may also assume that 
||y*Mj|| > X/y/n for every i, where Mi is M with the i th column 
removed. 

The remaining case is if ||Mx|| < X/y/n for some incompressible 
x. Let us call this event E. Write x = [x\, . . . , x n ), and let Y\, . . . , Y n 
be the column of M, thus 

11x111 + ■■■+x n y„|| < X/s/n. 

Letting Wi be the subspace spanned by all the Y\,...,Y n except for 
Yi, we conclude upon projecting to the orthogonal complement of Wi 
that 

|x l |dist(T 4 ,T^ l ) < X/Vn 

for all i (compare with Exercise 2.7.3). On the other hand, since x 
is incompressible, we see that \xi\ > e/y/n for at least en values of i, 
and thus 

(2.173) dM(Yi,Wi) < X/e. 

for at least en values of i. If we let Ei be the event that E and (2.173) 
both hold, we thus have from double-counting that 

1 " 

p(*o<-E p (*) 

i=l 

and thus by symmetry 

P(E) < -P(E n ) 
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(say). However, if E n holds, then setting y to be a unit normal vector 
to Wi (which is necessarily incompressible, by the hypothesis on Mj), 
we have 

\Yt ■ y\ < X/e. 

Again, the crucial point is that and y are independent. The incom- 
pressibility of y, combined with a Bcrry-Esseen type theorem, then 
gives 

Exercise 2.7.4. Show that 

P(|*< ' V\ < A/e) « e 2 

(say) if A is sufficiently small depending on e, and n is sufficiently 
large depending on e. 

This gives a bound of 0(e) for P(E) if A is small enough depend- 
ing on e, and n is large enough; this gives the claim. 

Remark 2.7.6. A variant of these arguments, based on inverse Littlewood- 
Offord theorems rather than the Berry-Esseen theorem, gives the vari- 
ant estimate 

(2.174) a n (^=M n -zI) > n~ A 

with high probability for some A > 0, and any z of polynomial size 
in n. There are several results of this type, with overlapping ranges of 
generality (and various values of A) [GoTi2007, PaZh2010, TaVu2008], 
and the exponent A is known to degrade if one has too few moment 
assumptions on the underlying random matrix M . This type of result 
(with an unspecified A) is important for the circular law, discussed 
in the next set of lectures. 

2.7.4. Upper bound for the least singular value. One can com- 
plement the lower tail estimate with an upper tail estimate: 

Theorem 2.7.7 (Upper tail estimate for the least singular value). 
For any A > 0, one has 

(2.175) PWI)>A/i/n)«oj4=o(l) + 
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We prove this using an argument of Rudelson and Vershynin[RuVe2009] . 
Suppose that a n (M) > X/^/n, then 

(2.176) II^M- 1 !! < Vn\\y\\/X 
for all y. 

Next, let Xi, . . . , X n be the rows of M, and let R\, . . . , R n be 
the columns of M , thus R±, . . . , R n is a dual basis for X\,..., X n . 
From (2.176) we have 

n 

$>-^l 2 <Hly|l 2 A 2 . 
i=i 

We apply this with y equal to X n —n n (X n ), where n n is the orthogonal 
projection to the space V n -i spanned by X\, . . . ,X n -i. On the one 
hand, we have 

||y|| 2 =dist(X„,K-i) 2 
and on the other hand we have for any 1 < i < n that 

y ■ Ri = -Tr n (X n ) ■ Ri = -X n ■ ir n (Ri) 

and so 

n-l 

(2.177) Yl \X n -^n{R l )\ 2 < ndist(X„,K-i) 2 /A 2 . 

i=l 

If (2.177) holds, then \X n • 7r„(i? 4 )| 2 = 0(dist(X„, K-i) 2 /A 2 ) for at 
least half of the i, so the probability in (2.175) can be bounded by 

n— 1 

« - V P(|X„ • Tr^i?,)! 2 = 0(dist(X„, K-i) 2 /A 2 )) 
i=i 

which by symmetry can be bounded by 

« P(|X„ • tt^R^ 2 = 0(dist(X„, K-i) 2 /A 2 )). 

Let e > be a small quantity to be chosen later. From Talagrand's 
inequality (Theorem 2.1.13) we know that dist(X„, V n -\) = O e (l) 
with probability 1 — 0(e), so we obtain a bound of 

« P(X n ■ TTniRi) = O e (l/A)) + 0(e). 

Now a key point is that the vectors 7r„(i?i), . . . , 7r n (i?„_i) depend 
only on X\, . . . , X n _\ and not on X n ; indeed, they are the dual basis 
for Xi, . . . , X„_i in V n -\. Thus, after conditioning X\, . . . , X n _\ 
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and thus n n (Ri) to be fixed, X n is still a Bernoulli random vector. 
Applying a Berry-Esseen inequality, we obtain a bound of 0(e) for the 
conditional probability that X n ■ 7r„(i?i) = O e (l/\) for A sufficiently 
small depending on e, unless 7r„(i?i) is compressible (in the sense that, 
say, it is within e of an en-sparse vector). But this latter possibility 
can be controlled (with exponentially small probability) by the same 
type of arguments as before; we omit the details. 

2.7.5. Asymptotic for the least singular value. The distribu- 
tion of singular values of a Gaussian random matrix can be computed 
explicitly by techniques similar to those employed in Section 2.6. In 
particular, if M is a real Gaussian matrix (with all entries iid with 
distribution N(0, 1)r), it was shown in [Edl988] that y/na n (M) con- 
verges in distribution to the distribution fiE '■= 1 2 y^ '' e~ x / 2 ~^* dx as 
n — > oo. It turns out that this result can be extended to other en- 
sembles with the same mean and variance. In particular, we have the 
following result from [TaVu2010]: 

Theorem 2.7.8. If M is an iid Bernoulli matrix, then yJna n {M) 
also converges in distribution to (j,e as n — > oo. (In fact there is a 
polynomial rate of convergence.) 

This should be compared with Theorems 2.7.5, 2.7.7, which show 
that y/na n (M) have a tight sequence of distributions in (0, +co). The 
arguments from [TaVu2010] thus provide an alternate proof of these 
two theorems. 

The arguments in [TaVu2010] do not establish the limit he di- 
rectly, but instead use the result of [Edl988] as a black box, focusing 
instead on establishing the universality of the limiting distribution of 
y/na n (M), and in particular that this limiting distribution is the same 
whether one has a Bernoulli ensemble or a Gaussian ensemble. 

The arguments are somewhat technical and we will not present 
them in full here, but instead give a sketch of the key ideas. 

In previous sections we have already seen the close relationship 
between the least singular value a n (M), and the distances dist(Aj, V$) 
between a row Xf of M and the hyperplane Vi spanned by the other 
n — 1 rows. It is not hard to use the above machinery to show that as 
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n — > oo, dist(Xj,Vi) converges in distribution to the absolute value 
1-^(0, 1)r| of a Gaussian regardless of the underlying distribution of 
the coefficients of M (i.e. it is asymptotically universal). The ba- 
sic point is that one can write dist(Xj,Vi) as |JQ • rn\ where rn is 
a unit normal of Vi (we will assume here that M is non-singular, 
which by previous arguments is true asymptotically almost surely). 
The previous machinery lets us show that rij is incompressible with 
high probability, and then claim then follows from the Berry-Esseen 
theorem. 

Unfortunately, despite the presence of suggestive relationships 
such as Exercise 2.7.3, the asymptotic universality of the distances 
dist(Xj,Vi) does not directly imply asymptotic universality of the 
least singular value. However, it turns out that one can obtain a 
higher-dimensional version of the universality of the scalar quantities 
dist(Xj,Vi), as follows. For any small k (say, 1 < k < n c for some 
small c > 0) and any distinct ii, . . . , ik € {1, . . . , n}, a modification 
of the above argument shows that the covariance matrix 

(2-178) (^J-^J)l<a, 6 <fc 

of the orthogonal projections Tr(Xi 1 ), . . . , ir(Xi k ) of the k rows ,X{ 
to the complement V£ ik of the space V^,...^ spanned by the other 
n — k rows of M, is also universal, converging in distribution to the 
covariance 57 matrix (G a -Gb)i< a ,b<k oik iid Gaussians G a = N(0, 1)r 
(note that the convergence of dist(Xj, Vi) to \N(0, 1)r| is the k = 1 
case of this claim). The key point is that one can show that the 
complement V^ ' i k is usually "incompressible" in a certain technical 
sense, which implies that the projections 7r(Xi a ) behave like iid Gaus- 
sians on that projection thanks to a multidimensional Berry-Esseen 
theorem. 

On the other hand, the covariance matrix (2.178) is closely related 
to the inverse matrix M^ 1 : 

Exercise 2.7.5. Show that (2.178) is also equal to A* A, where A is 
the n x k matrix formed from the i\, . . . ,ik columns of M _1 . 

In particular, this shows that the singular values of k randomly 
selected columns of M^ 1 have a universal distribution. 



These covariance matrix distributions arc also known as Wishart distributions. 
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Recall our goal is to show that y/na n (M) has an asymptotically 
universal distribution, which is equivalent to asking that -^=||M _1 || op 
has an asymptotically universal distribution. The goal is then to 
extract the operator norm of M" 1 from looking at a random n x k 
minor B of this matrix. This comes from the following application of 
the second moment method: 

Exercise 2.7.6. Let A be an n x n matrix with columns R\, . . . , R n , 
and let B be the n x k matrix formed by taking k of the columns 
i?i , . . . , R n at random. Show that 

n 

nA*A-^B*B\\%<^Y,WM\ 

fc=l 

where ||||f is the Frobcnius norm(2.64). 

Recall from Exercise 2.7.3 that ||i?fe|| = 1/ dist(Xfc, Vfc), so we 
expect each ||i?fe|| to have magnitude about 0(1). This, together 
with the Wielandt-Hoeffman inequality (1.68) means that we expect 
<7i ((M- 1 )*^- 1 )) =a n (M)- 2 to differ by 0{n 2 /k) from f <n(B* B) = 
j.<j\{B) 2 . In principle, this gives us asymptotic universality on y/na n (M) 
from the already established universality of B. 

There is one technical obstacle remaining, however: while we 
know that each dist(Xfe,V4) is distributed like a Gaussian, so that 
each individual Rk is going to be of size 0(1) with reasonably good 
probability, in order for the above exercise to be useful, one needs to 
bound all of the Rk simultaneously with high probability. A naive 
application of the union bound leads to terrible results here. Fortu- 
nately, there is a strong correlation between the Rk'- they tend to be 
large together or small together, or equivalently that the distances 
dist(Xfc, Vfe) tend to be small together or large together. Here is one 
indication of this: 

Lemma 2.7.9. For any 1 < k < i < n, one has 

lki(*OII 



dist(A^, V 4 ) > 



1 + Ej=i Kp^iidistp^v,) 
where tti is the orthogonal projection onto the space spanned by X\ , . . . , Xk , 
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Proof. We may relabel so that i = k + 1; then projecting everything 
by -Ki we may assume that n = k + 1. Our goal is now to show that 

dist(X„,K-i)> " X '" 



i + E"=i 11 A ' 



Recall that Ri , . . . , R n is a dual basis to X x , . . . , X„ . This implies in 
particular that 

n 

X = J2( X ■ X j) R 3 
3=1 

for any vector x; applying this to X n we obtain 

n-1 

X n = \\X n \\ 2 R n + 'y^(Xj ■ X n )Rj 

3 = 1 

and hence by the triangle inequality 

n-1 

\\X n || 2 ||-Rn|| — ||-^n|| + y ' \\Xj || || \\Rj ||- 
3 = 1 

Using the fact that H-RjH = 1/ dist(Xj, Rj), the claim follows. □ 

In practice, once k gets moderately large (e.g. k = n c for some 
small c > 0), one can control the expressions ||7Tj(Xj)|| appearing here 
by Talagrand's inequality (Theorem 2.1.13), and so this inequality 
tells us that once dist(X,-, Vj) is bounded away from zero for j = 
1, . . . , k, it is bounded away from zero for all other k also. This turns 
out to be enough to get enough uniform control on the Rj to make 
Exercise 2.7.6 useful, and ultimately to complete the proof of Theorem 
2.7.8. 



2.8. The circular law 

In this section, we leave the realm of self-adjoint matrix ensembles, 
such as Wigner random matrices, and consider instead the simplest 
examples of non-self-adjoint ensembles, namely the iid matrix ensem- 
bles. 

The basic result in this area is 
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Theorem 2.8.1 (Circular law). Let M n be annxn iid matrix, whose 
entries 1 < i,j < n are iid with a fixed (complex) distribution 
£ij = £ of mean zero and variance one. Then the spectral measure 
A* J-M converges both in probability and almost surely to the circular 

law /Ucirc := ^l|a;| 2 +|iy| 2 <i dxdy , where x,y are the real and imaginary 
coordinates of the complex plane. 

This theorem has a long history; it is analogous to the semicir- 
cular law, but the non-Hermitian nature of the matrices makes the 
spectrum so unstable that key techniques that are used in the semi- 
circular case, such as truncation and the moment method, no longer 
work; significant new ideas are required. In the case of random Gauss- 
ian matrices, this result was established by Mchta[Me2004] (in the 
complex case) and by Edelman[Edl996] (in the real case), as was 
sketched out in Section 2.6. In 1984, Girko[Gil984] laid out a general 
strategy for establishing the result for non-gaussian matrices, which 
formed the base of all future work on the subject; however, a key in- 
gredient in the argument, namely a bound on the least singular value 
of shifts ^M„ — zl, was not fully justified at the time. A rigorous 
proof of the circular law was then established by Bai[Bal997] , assum- 
ing additional moment and boundedness conditions on the individual 
entries. These additional conditions were then slowly removed in a 
sequence of papers [GoTi2007, Gi2004, PaZh2010, TaVu2008], 
with the last moment condition being removed in [TaVuKr2010]. 

At present, the known methods used to establish the circular law 
for general ensembles rely very heavily on the joint independence of 
all the entries. It is a key challenge to see how to weaken this joint 
independence assumption. 



2.8.1. Spectral instability. One of the basic difficulties present in 
the non-Hermitian case is spectral instability: small perturbations in 
a large matrix can lead to large fluctuations in the spectrum. In 
order for any sort of analytic technique to be effective, this type of 
instability must somehow be precluded. 
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The canonical example of spectral instability comes from perturb- 
ing the right shift matrix 



U := 



to the matrix 



/0 1 
1 



\0 

/0 1 
1 



U e := 





oj 

°\ 





\e ... 0/ 



for some e > 0. 



The matrix /7 is nilpotent: Uq — 0. Its characteristic polynomial 
is (— A) n , and it thus has n repeated eigenvalues at the origin. In 
contrast, U e obeys the equation [7™ = el, its characteristic polynomial 
is (—A)" — e(— 1)™, and it thus has n eigenvalues at the n th roots 
e i/n e 2mj/n^ j = o, . . . , n — 1 of e. Thus, even for exponentially small 
values of e, say e = 2~ n , the eigenvalues for U e can be quite far from 
the eigenvalues of Uq, and can wander all over the unit disk. This is in 
sharp contrast with the Hermitian case, where eigenvalue inequalities 
such as the Weyl inequalities (1.64) or Wielandt-Hoffman inequalities 
(1.68) ensure stability of the spectrum. 

One can explain the problem in terms of p seudo spectrum^ . The 
only spectrum of U is at the origin, so the resolvents (U — zl)^ 1 of 
U are finite for all non-zero z. However, while these resolvents are 
finite, they can be extremely large. Indeed, from the nilpotent nature 
of Uq we have the Neumann series 



(u Q - ziy 



1 



Uo 

,2 



jjn- 



so for | z\ < 1 we see that the resolvent has size roughly \z\ ™, which is 
exponentially large in the interior of the unit disk. This exponentially 



^The pseudospectrum of an operator T is the set of complex numbers z for which 
the operator norm ||(T — ^/) _1 || op is cither infinite, or larger than a fixed threshold 
l/e. 
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large size of resolvent is consistent with the exponential instability of 
the spectrum: 

Exercise 2.8.1. Let M be a square matrix, and let z be a complex 
number. Show that ||(M — z/) _1 || p > R if and only if there exists a 
perturbation M + E of M with ||£|| op < 1/R such that M + E has z 
as an eigenvalue. 

This already hints strongly that if one wants to rigorously prove 
control on the spectrum of M near z, one needs some sort of upper 
bound on ||(M — zi") -1 || op , or equivalently one needs some sort of 
lower bound on the least singular value cr n (M — zl) of M — zl. 

Without such a bound, though, the instability precludes the di- 
rect use of the truncation method, which was so useful in the Her- 
mitian case. In particular, there is no obvious way to reduce the 
proof of the circular law to the case of bounded coefficients, in con- 
trast to the semicircular law where this reduction follows easily from 
the Wielandt-Hoffman inequality (see Section 2.4). Instead, we must 
continue working with unbounded random variables throughout the 
argument (unless, of course, one makes an additional decay hypothe- 
sis, such as assuming certain moments are finite; this helps explain the 
presence of such moment conditions in many papers on the circular 
law). 

2.8.2. Incompleteness of the moment method. In the Hermit- 
ian case, the moments 



of a matrix can be used (in principle) to understand the distribution 
ji i completely (at least, when the measure fx i has sufficient 
decay at infinity. This is ultimately because the space of real poly- 
nomials P(x) is dense in various function spaces (the Weierstrass 
approximation theorem) . 

In the non-Hermitian case, the spectral measure fi i is now 



supported on the complex plane rather than the real line. One still 




v -17 A 1 
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has the formula 



-ti(^=M) k = [ z k djji i (z) 



but it is much less useful now, because the space of complex polyno- 
mials P{z) no longer has any good density properties 59 . In particular, 
the moments no longer uniquely determine the spectral measure. 

This can be illustrated with the shift examples given above. It is 
easy to see that U and U e have vanishing moments up to (n — l) th 
order, i.e. 



for k = 1, ...,n — 1. Despite this enormous number of matching 
moments, the spectral measures (i i and \i i are vastly different; 
the former is a Dirac mass at the origin, while the latter can be 
arbitrarily close to the unit circle. Indeed, even if we set all moments 
equal to zero, 



for k — 1,2,..., then there are an uncountable number of possible 
(continuous) probability measures that could still be the (asymptotic) 
spectral measure /i: for instance, any measure which is rotationally 
symmetric around the origin would obey these conditions. 

If one could somehow control the mixed moments 



of the spectral measure, then this problem would be resolved, and one 
could use the moment method to reconstruct the spectral measure 
accurately. However, there does not appear to be any obvious way 
to compute this quantity; the obvious guess of - tr(^M n ) fc {-^M*) 1 
works when the matrix M n is normal, as M n and M* then share the 



for k = 1 





/ z k z l d^, Mn {z) = -jT(^\ J (M n )) k (^-\ 1 (M n )) 
Jr Vn n ~\ v n V n 



For instance, the uniform closure of the space of polynomials on the unit disk 
is not the space of continuous functions, but rather the space of holomorphic functions 
that arc continuous on the closed unit disk. 
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same basis of eigenvectors, but generically one does not expect these 
matrices to be normal. 

Remark 2.8.2. The failure of the moment method to control the 
spectral measure is consistent with the instability of spectral mea- 
sure with respect to perturbations, because moments are stable with 
respect to perturbations. 

Exercise 2.8.2. Let k > 1 be an integer, and let M n be an iid matrix 
whose entries have a fixed distribution £ with mean zero, variance 1, 
and with fc th moment finite. Show that ^tr(^M n ) fc converges to 
zero as n — > oo in expectation, in probability, and in the almost sure 
sense. Thus we see that f„ z k du i (z) converges to zero in these 



three senses also. This is of course consistent with the circular law, 
but does not come close to establishing that law, for the reasons given 
above. 

The failure of the moment method also shows that methods of free 
probability (Section 2.5) do not work directly. For instance, observe 
that for fixed e, Uo and U e (in the noncommutative probability space 
(Mat„(C), - tr)) both converge in the sense of *-moments as n —¥ oo 
to that of the right shift operator on £ 2 (Z) (with the trace t(T) = 
(e ,Teo), with e being the Kronecker delta at 0); but the spectral 
measures of Uo and U e are different. Thus the spectral measure cannot 
be read off directly from the free probability limit. 

2.8.3. The logarithmic potential. With the moment method out 
of consideration, attention naturally turns to the Stieltjes transform 



Even though the measure ^ij_ Mji is now supported on C rather than 
R, the Stieltjes transform is still well-defined. The Plemelj formula 
for reconstructing spectral measure from the Stieltjes transform that 
was used in previous sections is no longer applicable, but there are 
other formulae one can use instead, in particular one has 

Exercise 2.8.3. Show that 



v 'nM 




MJ=m„ = -dzs n (z) 
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in the sense of distributions, where 

is the Cauchy-Riemann operator. 

One can control the Stieltjes transform quite effectively away from 
the origin. Indeed, for iid matrices with subgaussian entries, one can 
show (using the methods from Section 2.3) that the operator norm 
of is 1 + o(l) almost surely; this, combined with (2.8.2) and 

Laurent expansion, tells us that s n (z) almost surely converges to — 1/z 
locally uniformly in the region {z : \z\ > 1}, and that the spectral 
measure [i_^ Mn converges almost surely to zero in this region (which 
can of course also be deduced directly from the operator norm bound) . 
This is of course consistent with the circular law, but is not sufficient 
to prove it (for instance, the above information is also consistent 
with the scenario in which the spectral measure collapses towards the 
origin). One also needs to control the Stieltjes transform inside the 
disk {z : \z\ < 1} in order to fully control the spectral measure. 

For this, existing methods (such as predecessor comparison) are 
not particularly effective (mainly because of the spectral instability 
and also because of the lack of analyticity in the interior of the spec- 
trum). Instead, one proceeds by relating the Stieltjes transform to 
the logarithmic potential 



f n (z):= / log|u> - z\dfi i M Jw). 
Jc v " 

It is easy to see that s n (z) is essentially the (distributional) gradient 

of fn(z): 

d d 
s n (z) = (--+i-)f n (z), 

and thus g n is related to the spectral measure by the distributional 
formula 60 

(2-179) M-i=M n = ^A/„ 



where A := + ^2 is the Laplacian. 



^This formula just reflects the fact that ^ log \ z\ is the Newtonian potential in 
two dimensions. 
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In analogy to previous continuity theorems, we have 

Theorem 2.8.3 (Logarithmic potential continuity theorem). Let M n 

be a sequence of random matrices, and suppose that for almost every 
complex number z, f n (z) converges almost surely (resp. in probability) 
to 



Jc 

for some probability measure [i. Then p_^ Mn converges almost surely 
(resp. in probability) to p in the vague topology. 

Proof. We prove the almost sure version of this theorem, and leave 
the convergence in probability version as an exercise. 

On any bounded set K in the complex plane, the functions log | • 
—w\ lie in L 2 (K) uniformly in w. From Minkowski's integral in- 
equality, we conclude that the /„ and / are uniformly bounded in 
L 2 (K). On the other hand, almost surely the /„ converge pointwise 
to /. From the dominated convergence theorem this implies that 
min(|/„ — f\,M) converges in ^(K) to zero for any M; using the 
uniform bound in L 2 (K) to compare min(|/„ — f\,M) with |/„ — f\ 
and then sending M — > oo, we conclude that f n converges to / in 
In particular, /„ converges to / in the sense of distribu- 
tions; taking distributional Laplacians using (2.179) we obtain the 
claim. □ 

Exercise 2.8.4. Establish the convergence in probability version of 
Theorem 2.8.3. 

Thus, the task of establishing the circular law then reduces to 
showing, for almost every z, that the logarithmic potential f n (z) con- 
verges (in probability or almost surely) to the right limit f(z). 

Observe that the logarithmic potential 





n 



can be rewritten as a log-determinant: 



f n (z) - - log | det(-^M„ -zl)\. 
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To compute this determinant, we recall that the determinant of a 
matrix A is not only the product of its eigenvalues, but also has a 
magnitude equal to the product of its singular values: 

n n 

\dctA\ = l[a 3 (A) = U X M*^ 1/2 

3=1 J=l 



and thus 



1 f°° 

fn{z) = ~ J logX dl> ntZ (x) 



where dv n ,z is the spectral measure of the matrix (^M n —zI)*(^M n - 
zl). 

The advantage of working with this spectral measure, as opposed 
to the original spectral measure //x^, is that the matrix (-^=M„ — 

zl)* (^=M„ — zT) is self-adjoint, and so methods such as the moment 
method or free probability can now be safely applied to compute 
the limiting spectral distribution. Indeed, Girko[Gil984] established 
that for almost every z, v n>z converged both in probability and almost 
surely to an explicit (though slightly complicated) limiting measure 
v z in the vague topology. Formally, this implied that f n (z) would 
converge pointwise (almost surely and in probability) to 



2/ 



logs dv z (x). 



A lengthy but straightforward computation then showed that this 
expression was indeed the logarithmic potential f(z) of the circular 
measure ^ c irc, so that the circular law would then follow from the 
logarithmic potential continuity theorem. 

Unfortunately, the vague convergence of v n ,z to v z only allows 
one to deduce the convergence of J °° F(x) dv n>z to f °° F(x) dv z for 
F continuous and compactly supported. Unfortunately, log x has sin- 
gularities at zero and at infinity, and so the convergence 

pOO poo 

/ logx dv niZ (x) -> / logx dv z (x) 
Jo Jo 

can fail if the spectral measure v niZ sends too much of its mass to 
zero or to infinity. 



272 



2. Random matrices 



The latter scenario can be easily excluded, either by using oper- 
ator norm bounds on M n (when one has enough moment conditions) 
or even just the Frobenius norm bounds (which require no moment 
conditions beyond the unit variance). The real difficulty is with pre- 
venting mass from going to the origin. 

The approach of Bai[Bal997] proceeded in two steps. Firstly, he 
established a polynomial lower bound 

a n (^-M n - zl) > n- c 

asymptotically almost surely for the least singular value of -^M n — 
zl. This has the effect of capping off the logo; integrand to be 
of size O(logn). Next, by using Stieltjes transform methods, the 
convergence of v n , z to v z in an appropriate metric (e.g. the Levi 
distance metric) was shown to be polynomially fast, so that the 
distance decayed like 0(n~ c ) for some c > 0. The 0(n~ c ) gain 
can safely absorb the O(logn) loss, and this leads to a proof of 
the circular law assuming enough boundedness and continuity hy- 
potheses to ensure the least singular value bound and the conver- 
gence rate. This basic paradigm was also followed by later works 
[GoTi2007, PaZh2010, TaVu2008], with the main new ingredient 
being the advances in the understanding of the least singular value 
(Section 2.7). 

Unfortunately, to get the polynomial convergence rate, one needs 
some moment conditions beyond the zero mean and unit variance 
rate (e.g. finite 2 + 77 th moment for some 77 > 0). In my paper with 
Vu and Krishnapur, we used the additional tool of the Talagrand 
concentration inequality (Theorem 2.1.13) to eliminate the need for 
the polynomial convergence. Intuitively, the point is that only a small 
fraction of the singular values of -^M n ~zl are going to be as small as 
n~ c ; most will be much larger than this, and so the O(logn) bound is 
only going to be needed for a small fraction of the measure. To make 
this rigorous, it turns out to be convenient to work with a slightly 
different formula for the determinant magnitude | det(A)| of a square 
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matrix than the product of the eigenvalues, namely the base-times- 
hcight formula 

n 

\det(A)\ = l[dist(X 3 ,V 3 ) 

3 = 1 

where Xj is the j th row and Vj is the span of X\, . . . , Xj_\. 
Exercise 2.8.5. Establish the inequality 

n m m 

n aj (A) < n dist(x, , < n °, W 

j=n+l-m j=l j = l 

for any 1 < m < n. (Hint: the middle product is the product of 
the singular values of the first m rows of A, and so one should try to 
use the Cauchy interlacing inequality for singular values, see Section 

1.3.3. ) Thus we see that dist(X, , Vj) is a variant of aj(A). 

The least singular value bounds, translated in this language (with 
A := ^M n - zl), tell us that dist(Xj,Vj) > rT c with high proba- 
bility; this lets ignore the most dangerous values of j, namely those 
j that are equal to n — 0(n°") (say). For low values of j, say 
j < (1 — S)n for some small S, one can use the moment method 
to get a good lower bound for the distances and the singular values, 
to the extent that the logarithmic singularity of log x no longer causes 
difficulty in this regime; the limit of this contribution can then be seen 
by moment method or Sticltjes transform techniques to be universal 
in the sense that it does not depend on the precise distribution of the 
components of M n . In the medium regime (1 — S)n < j < n — n ' 99 , 
one can use Talagrand's inequality (Theorem 2.1.13) to show that 
dist(X j,Vj) has magnitude about ^n — j, giving rise to a net con- 
tribution to f n (z) of the form ^J2(i-8)n<j<n-n«-™ °( lo g V" ~j), 
which is small. Putting all this together, one can show that f n (z) 
converges to a universal limit as n — >• oo (independent of the compo- 
nent distributions); see [TaVuKr2010] for details. As a consequence, 
once the circular law is established for one class of iid matrices, such 
as the complex Gaussian random matrix ensemble, it automatically 
holds for all other ensembles also. 

2.8.4. Brown measure. We mentioned earlier that due to eigen- 
value instability (or equivalently, due to the least singular value of 
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shifts possibly going to zero), the moment method (and thus, by ex- 
tension, free probability) was not sufficient by itself to compute the 
asymptotic spectral measure of non-Hermitian matrices in the large n 
limit. However, this method can be used to give a heuristic prediction 
as to what that measure is, known as the Brown measure [Br 1986]. 
While Brown measure is not always the limiting spectral measure of 
a sequence of matrices, it turns out in practice that this measure can 
(with some effort) be shown to be the limiting spectral measure in 
key cases. As Brown measure can be computed (again, after some ef- 
fort) in many cases, this gives a general strategy towards computing 
asymptotic spectral measure for various ensembles. 

To define Brown measure, we use the language of free probabil- 
ity (Section 2.5). Let u be a bounded element (not necessarily self- 
adjoint) of a non-commutative probability space (A, r), which we will 
assume to be tracial. To derive Brown measure, we mimic the Girko 
strategy used for the circular law. Firstly, for each complex number 
z, we let v z be the spectral measure of the non-negative self-adjoint 
element (u — z)*(u — z). 

Exercise 2.8.6. Verify that the spectral measure of a positive ele- 
ment u*u is automatically supported on the non- negative real axis. 
{Hint: Show that t{P(u*u)u*uP(u*u)) > for any real polynomial 
P, and use the spectral theorem.) 

By the above exercise, v z is a compactly supported probability 
measure on [0, +oo). We then define the logarithmic potential f(z) 
by the formula 



Note that / may equal — oo at some points. 

To understand this determinant, we introduce the regularised de- 
terminant 



* Jo 

for e > 0. From the monotone convergence theorem we see that f E (z) 
decreases pointwise to f(z) as e — > 0. 
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We now invoke the Gelfand-Naimark theorem (Exercise 2.5.10) 
and embed 61 A into the space of bounded operators on L 2 (t), so that 
we may now obtain a functional calculus. Then we can write 

f e {z)= l -r{\og{e + {u- z)*{u- z))). 

One can compute the first variation of f e : 

Exercise 2.8.7. Let e > 0. Show that the function f s is continuously 
diffcrentiable with 

d x f e (z) = - Rcr((e + (« - z)*{u - z))~\u - z)) 

and 

d y f £ (z) = -Imr(( £ + (u - z)*{u - z))' 1 ^ - z)). 

Then, one can compute the second variation at, say, the origin: 

Exercise 2.8.8. Let e > 0. Show that the function f e is twice 
continuously diffcrentiable with 

d xx f £ (0) = Rer((e + ^u)- 1 - (e + + u*)(e + vTu)' 1 ^ 

and 

d yy fe(0) = Rct((£ + u*m)- 1 - (£ + u*u) _1 (u* - u)(e + ^u^u). 
We conclude in particular that 

A/ e (0) = 2Re T({e + u*u)- 1 - (s + u*u)"V(£ + u*u)- l u) 
or equivalently 

^/ e (0) = 2(||( e +«^)- 1 / 2 ||i 2( ^ ) -||(£+«^)- 1 / 2 «( e +^*^)- 1 / 2 ||i 2 ^ ) ). 
Exercise 2.8.9. Show that 

\\(e + u*u)-^ 2 u(e + u^r^W^^ < \\(e + u*u)-^ 2 \\ L 2 {r) . 
(Hint: Adapt the proof of Lemma 2.5.13.) 

^*If r is not faithful, this embedding need not be injective, but this will not be 
an issue in what follows. 
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We conclude that A/ e is non-negative at zero. Translating u by 
any complex number we see that A/ e is non-negative everywhere, 
that is to say that f s is subharmonic. Taking limits we see that / is 
subharmonic also; thus if we define the Brown measure p — p u of u 
as 

(cf. (2.179)) then p is a non-negative measure. 

Exercise 2.8.10. Show that for \z\ > p{u) := p^u) 1 ^ 2 , f is contin- 
uously diffcrcntiable with 

d x f(z) = -R C r((u-z)- 1 ) 

and 

dyf(z) =lmT((u-z)- 1 ) 

and conclude that / is harmonic in this region; thus Brown measure 
is supported in the disk {z : \z\ < p(u)}. Using Green's theorem, 
conclude also that Brown measure is a probability measure. 

Exercise 2.8.11. In a finite-dimensional non-commutative probabil- 
ity space (Mat„(C), - tr), show that Brown measure is the same as 
spectral measure. 

Exercise 2.8.12. In a commutative probability space (L°°(f2),E), 
show that Brown measure is the same as the probability distribution. 

Exercise 2.8.13. If u is the left shift on £ 2 (Z) (with the trace t(T) := 
(Teo, eo)), show that the Brown measure of u is the uniform measure 
on the unit circle {z e C : \z\ = 1}. 

This last exercise illustrates the limitations of Brown measure for 
understanding asymptotic spectral measure. The shift Uq and the 
perturbed shift U e introduced in previous sections both converge in 
the sense of *-moments as n — > oo (holding e fixed) to the left shift 
u. For non-zero e, the spectral measure of U e does indeed converge 
to the Brown measure of u, but for e — this is not the case. This 
illustrates a more general principle 62 , that Brown measure is the right 



Sec [Sn2002] for a precise formulation of this heuristic, using Gaussian 
rcgularisation. 
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asymptotic limit for "generic" matrices, but not for exceptional ma- 
trices. 

The machinery used to establish the circular law in full generality 
can be used to show that Brown measure is the correct asymptotic 
spectral limit for other models: 

Theorem 2.8.4. Let M n be a sequence of random matrices whose en- 
tries are joint independent and with all moments uniformly bounded, 
with variance uniformly bounded from below, and which converges in 
the sense of * -moments to an element u of a non- commutative prob- 
ability space. Then the spectral measure fij_ Mn converges almost 
surely and in probability to the Brown measure of u. 

This theorem is essentially [TaVuKr2010, Theorem 1.20]. The 
main ingredients are those mentioned earlier, namely a polynomial 
lower bound on the least singular value, and the use of Talagrand's 
inequality (Theorem 2.1.13) to control medium singular values (or 
medium codimension distances to subspaces) . Of the two ingredients, 
the former is more crucial, and is much more heavily dependent at 
present on the joint independence hypothesis; it would be of interest 
to see how to obtain lower bounds on the least singular value in more 
general settings. 
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3.1. Brownian motion and Dyson Brownian 
motion 

One theme in this text will be the central nature played by the gauss- 
ian random variables X = N(fj,, a 2 ). Gaussians have an incredibly 
rich algebraic structure, and many results about general random vari- 
ables can be established by first using this structure to verify the re- 
sult for Gaussians, and then using universality techniques (such as the 
Lindeberg exchange strategy) to extend the results to more general 
variables. 

One way to exploit this algebraic structure is to continuously 
deform the variance t := a 2 from an initial variance of zero (so that 
the random variable is deterministic) to some final level T. We would 
like to use this to give a continuous family t i->- X t of random variables 
X t = N(fj,,t) as t (viewed as a "time" parameter) runs from to T. 

At present, we have not completely specified what X t should 
be, because we have only described the individual distribution X t = 
N(/j,,t) of each X t , and not the joint distribution. However, there is 
a very natural way to specify a joint distribution of this type, known 
as Brownian motion. In this section we lay the necessary probability 
theory foundations to set up this motion, and indicate its connection 
with the heat equation, the central limit theorem, and the Ornstein- 
Uhlenbeck process. This is the beginning of stochastic calculus, which 
we will not develop fully here. 

We will begin with one-dimensional Brownian motion, but it is 
a simple matter to extend the process to higher dimensions. In par- 
ticular, we can define Brownian motion on vector spaces of matrices, 
such as the space ofnxn Hermitian matrices. This process is equi- 
variant with respect to conjugation by unitary matrices, and so we 
can quotient out by this conjugation and obtain a new process on 
the quotient space, or in other words on the spectrum of n x n Her- 
mitian matrices. This process is called Dyson Brownian motion, and 
turns out to have a simple description in terms of ordinary Brownian 
motion; it will play a key role in this text. 
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3.1.1. Formal construction of Brownian motion. We begin with 
constructing one-dimensional Brownian motion. We shall model this 
motion using the machinery of Wiener processes: 

Definition 3.1.1 (Wiener process). Let /i € R, and let S C [0, +oo) 
be a set of times containing 0. A (one-dimensional) Wiener process 
on E with initial position ^ is a collection (A t ) te s of real random 
variables X t for each time <eE, with the following properties: 

(i) X Q = fj,. 

(ii) Almost surely, the map t M- X t is a continuous function on 
S. 

(iii) For every < t_ < t + in S, the increment X t+ ~X t _ has the 
distribution of N(0, t + —t_) R . (In particular, X t = N(p,,t)fi 
for every t > 0.) 

(iv) For every to < t\ < . . . < t n in S, the increments X ti —X ti _ 1 
for i = 1, . . . , n are jointly independent. 

If E is discrete, we say that {X t )teY, is a discrete Wiener process] if 
E = [0, +oo) then we say that (X t )teT, is a continuous Wiener process. 

Remark 3.1.2. Collections of random variables (X t ) teS , where E is 
a set of times, will be referred to as stochastic processes, thus Wiener 
processes are a (very) special type of stochastic process. 

Remark 3.1.3. In the case of discrete Wiener processes, the conti- 
nuity requirement (ii) is automatic. For continuous Wiener processes, 
there is a minor technical issue: the event that t i— > X t is continu- 
ous need not be a measurable event (one has to take uncountable 
intersections to define this event). Because of this, we interpret (ii) 
by saying that there exists a measurable event of probability 1, such 
that t i-> X t is continuous on all of this event, while also allowing 
for the possibility that t ^ X t could also sometimes be continuous 
outside of this event also. One can view the collection (X t )teT, as a 
single random variable, taking values in the product space R E (with 
the product cr-algebra, of course) . 

Remark 3.1.4. One can clearly normalise the initial position pL of a 
Wiener process to be zero by replacing X t with X t — [i for each t. 
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We shall abuse notation somewhat and identify continuous Wiener 
processes with Brownian motion in our informal discussion, although 
technically the former is merely a model for the latter. To empha- 
sise this link with Brownian motion, we shall often denote continuous 
Wiener processes as (-Bt)te[o,+oo) rather than (X t ) te [ 0t+oc y 

It is not yet obvious that Wiener processes exist, and to what 
extent they are unique. The situation is easily clarified though for 
discrete processes: 

Proposition 3.1.5 (Discrete Brownian motion). Let E be a discrete 
subset of [0, +oo) containing 0, and let fi G R. Then (after extending 
the sample space if necessary) there exists a Wiener process (X t ) te -£ 
with base point /i. Furthermore, any other Wiener process {X' t ) t ^Y, 
with base point \i has the same distribution as fi. 

Proof. As E is discrete and contains 0, we can write it as {to, ti, <2, . . .} 
for some 

= t < ti < t 2 < . . . . 

Let (dX i )°^ 1 be a collection of jointly independent random variables 
with dXi = N(0,ti — ^-i)r (the existence of such a collection, after 
extending the sample space, is guaranteed by Exercise 1.1.20). If we 
then set 

X u :=(!, + dXi H h dXi 

for all i = 0,1,2,..., then one easily verifies (using Exercise 2.1.9) 
that (X t ) te j: is a Wiener process. 

Conversely, if (X' t ) t& s is a Wiener process, and we define dX[ := 
X[ — X' i _ 1 for i = 1,2,..., then from the definition of a Wiener 
process we see that the dX[ have distribution N(0,ti — ^-i)r and 
are jointly independent (i.e. any finite subcollection of the dX[ are 
jointly independent). This implies for any finite n that the random 
variables (c/Xj)" =1 and (rfX^)™ =1 have the same distribution, and thus 
(X t ) te £/ and (X' t )teY,< have the same distribution for any finite subset 
E' of E. From the construction of the product a- algebra we conclude 
that {X t )teT. and (X' t ) te Y. have the same distribution, as required. □ 



Now we pass from the discrete case to the continuous case. 
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Proposition 3.1.6 (Continuous Brownian motion). Let fi € R. 

Then (after extending the sample space if necessary) there exists a 
Wiener process (X t ) t e[o,+oo) with base point /i. Furthermore, any 
other Wiener process (X' t ) t<£ [o.+oo) with base point \i has the same 
distribution as fj,. 



Proof. The uniqueness claim follows by the same argument used to 
prove the uniqueness component of Proposition 3.1.5, so we just prove 
existence here. The iterative construction we give here is somewhat 
analogous to that used to create self-similar fractals, such as the Koch 
snowflake. (Indeed, Brownian motion can be viewed as a probabilistic 
analogue of a self-similar fractal.) 

The idea is to create a sequence of increasingly fine discrete Brow- 
nian motions, and then to take a limit. Proposition 3.1.5 allows one 
to create each individual discrete Brownian motion, but the key is to 
couple these discrete processes together in a consistent manner. 

Here's how. We start with a discrete Wiener process (X t ) teN on 
the natural numbers N = {0, 1, 2 . . .} with initial position [i, which 
exists by Proposition 3.1.5. We now extend this process to the denser 
set of times |N := {|n : n € N} by setting 

X t+ i .= h Y t ,o 

for t = 0, 1,2, . . ., where (Yt j0 )teN are iid copies of N(0, 1/4)r, which 
are jointly independent of the (X t )teTsi- It is a routine matter to use 
Exercise 2.1.9 to show that this creates a discrete Wiener process 
(X ( ) t£ i N on which extends the previous process. 

Next, we extend the process further to the denser set of times 
jN by defining 

X tH := Xt+ f +1/2 + Y t , 

where (7y) ie i N are iid copies of iV(0, 1/8)r, jointly independent of 
(X t ) ig i N . Again, it is a routine matter to show that this creates a 
discrete Wiener process (X t ) te i N on jN. 
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Iterating this procedure a countable number 1 of times, we obtain 
a collection of discrete Wiener processes (I() (f i N for fc = 0, 1, 2, . . . 

2 k 

which are consistent with each other, in the sense that the earlier 
processes in this collection are restrictions of later ones. 

Now we establish a Holder continuity property. Let 9 be any 
exponent between and 1/2, and let T > be finite. Observe that 
for any k = 0, 1, . . . and any j € N, we have X(j +1 y 2 k ~ ^j/2 k = 
N(0, l/2 fe )R, and hence (by the subgaussian nature of the normal 
distribution) 

P(\X u+1)/2 , -X m u\ > 2- ke ) < Ccxp(-c2 k ^- 2 ^) 

for some absolute constants C, c. The right-hand side is summablc 
as j, k run over N subject to the constraint j/2 k < T. Thus, by the 
Borcl-Cantelli lemma, for each fixed T, we almost surely have that 

\X( J+ t)/2 k - Xj/2 k \ < 2~ ke 

for all but finitely many j,k e N with j/2 k < T. In particular, 
this implies that for each fixed T, the function t i-> Xt is almost 
surely Holder continuous 2 of exponent 9 on the dyadic rationals j/2 k 
in [0,T], and thus (by the countable union bound) is almost surely 
locally Holder continuous of exponent 6 on the dyadic rationals in 
[0, +oo). In particular, they are almost surely locally uniformly con- 
tinuous on this domain. 

As the dyadic rationals are dense in [0, +oo), wc can thus almost 
surely 3 extend ( 4 I ( uniquely to a continuous function on all of 
[0, +oo). Note that if t n is any sequence in [0, +oo) converging to 
t, then X tn converges almost surely to X tl and thus also converges 
in probability and in distribution. Similarly for differences such as 
X t+n — X t _ n . Using this, we easily verify that (^Q)t£[o,+oo) is a 
continuous Wiener process, as required. □ 



This requires a countable number of extensions of the underlying sample space, 
but one can capture all of these extensions into a single extension via the machinery of 
inverse limits of probability spaces; it is also not difficult to manually build a single 
extension sufficient for performing all the above constructions. 

^In other words, there exists a constant Ct such that \X S — X t \ < Ct\s — t\ 6 for 
all s, t e [0,T], 

^On the remaining probability zero event, wc extend t i— y X t in some arbitrary 
measurable fashion. 
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Remark 3.1.7. One could also have used the Kolmogorov extension 
theorem (sec e.g. [Ta2011]) to establish the limit. 

Exercise 3.1.1. Let (^t)te[o.+oo) be a continuous Wiener process. 
We have already seen that if < 9 < 1/2, that the map t i-> X t is 
almost surely Holder continuous of order 0. Show that if 1/2 < 9 < 1, 
then the map 1 1-> X t is almost surely not Holder continuous of order 
9. 

Show also that the map t t— > X t is almost surely nowhere diffcr- 
entiable. Thus, Brownian motion provides a (probabilistic) example 
of a continuous function which is nowhere differentiable. 

Remark 3.1.8. In the above constructions, the initial position /j, 
of the Wiener process was deterministic. However, one can easily 
construct Wiener processes in which the initial position Xq is itself a 
random variable. Indeed, one can simply set 

Xt ■= Xq + B t 

where (2?t)te[o,+oo) is a continuous Wiener process with initial po- 
sition which is independent of Xq. Then we see that X t obeys 
properties (ii), (hi), (iv) of Definition 3.1.1, but the distribution of 
X t is no longer N(fj,,t)n, but is instead the convolution of the law of 
Xo, and the law of N(Q,t)n. 

3.1.2. Connection with random walks. We saw how to construct 
Brownian motion as a limit of discrete Wiener processes, which were 
partial sums of independent Gaussian random variables. The central 
limit theorem (see Section 2.2) allows one to interpret Brownian mo- 
tion in terms of limits of partial sums of more general independent 
random variables, otherwise known as (independent) random walks. 

Definition 3.1.9 (Random walk). Let AX be a real random variable, 
let fj, € R be an initial position, and let At > be a time step. We 
define a discrete random walk with initial position /z, time step At and 
step distribution AX (or /j,ax) to be a process (X t )teAt-~N defined by 

n 

X n At ■= M + ^ XiAt 
»=1 

where (AXja*)^! are iid copies of AX. 
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Example 3.1.10. From the proof of Proposition 3.1.5, we see that 
a discrete Wiener process on At • N with initial position fi is nothing 
more than a discrete random walk with step distribution of N(0, A£)r. 
Another basic example is simple random walk, in which AX is equal 
to (At) 1 / 2 times a signed Bernoulli variable, thus we have -X"( n +i)At = 
X n At ± (At) 1 / 2 , where the signs ± are unbiased and are jointly inde- 
pendent in n. 

Exercise 3.1.2 (Central limit theorem). Let X be a real random 
variable with mean zero and variance 1, and let \x € R. For each 
At > 0, let (Xj A ^) te [ . +oo ) be a process formed by starting with 
a random walk (x[ At ' ) )teAt-N with initial position /i, time step At, 
and step distribution (At) 1 / 2 ^, and then extending to other times in 
[0, +oo), in a piecewise linear fashion, thus 

A (n+0)Ai — I 1 _C V A nAt + 6A >+l)At 

for all n e N and < 9 < 1. Show that as At 0, the pro- 
cess (Xf At ^) te ^ t+00 ) converges in distribution to a continuous Wiener 
process with initial position fi. (Hint: from the Riesz representation 
theorem (or the Kolmogorov extension theorem), it suffices to estab- 
lish this convergence for every finite set of times in [0, +oo). Now use 
the central limit theorem; treating the piecewise linear modifications 
to the process as an error term.) 

3.1.3. Connection with the heat equation. Let (-Bt)te [o.+oo) be 
a Wiener process with base point \i, and let F : R — > R be a smooth 
function with all derivatives bounded. Then, for each time t, the ran- 
dom variable F(B t ) is bounded and thus has an expectation EF(B t ). 
From the almost sure continuity of B t and the dominated convergence 
theorem we see that the map 1 1— > ~EF(B t ) is continuous. In fact it is 
differentiable, and obeys the following differential equation: 

Lemma 3.1.11 (Equation of motion). For all times t > 0, we have 
jVF{B t ) = l -VF xx (B t ) 

where F xx is the second derivative of F. In particular, 1 1-> ~EF(B t ) is 
continuously differentiable (because the right-hand side is continuous). 
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Proof. We work from first principles. It suffices to show for fixed 
t > 0, that 

EF(B t+dt ) = EF(B t ) + ^dtEF xx (B t ) + o(dt) 

as dt — > 0. We shall establish this just for non-negative dt; the claim 
for negative dt (which only needs to be considered for t > 0) is similar 
and is left as an exercise. 

Write dB t := B t +dt—B t . From Taylor expansion and the bounded 
third derivative of F, we have 

(3.1) F(B t+dt ) = F(B t )+F x (B t )dB t + ^F xx (B t )\dB t \ 2 + 0(\dB t \ 3 ). 

We take expectations. Since dB t = N(0 7 dt)n 7 we have E|<iB t | 3 = 
0((dt) 3 / 2 ), so in particular 

EF(B t+dt) = EF(B t ) + EF x (B t)d B t + \*F xx{ B t )\dB t ? + o(dt). 

Now observe that <iSt is independent of B tl and has mean zero and 
variance dt. The claim follows. □ 

Exercise 3.1.3. Complete the proof of the lemma by considering 
negative values of dt. (Hint: one has to exercise caution because dB t 
is not independent of B t in this case. However, it will be indepen- 
dent of B t +dt- Also, use the fact that EF x (B t ) and EF xx (B t ) are 
continuous in t. Alternatively, one can deduce the formula for the 
left-derivative from that of the right-derivative via a careful applica- 
tion of the fundamental theorem of calculus, paying close attention 
to the hypotheses of that theorem.) 

Remark 3.1.12. In the language of Ito calculus, we can write (3.1) 

as 

(3.2) dF(B t ) = F x (B t )dB t + ^F xx (B t )dt. 

Here, dF(B t ) := F(B t+dt ) - F(B t ), and dt should either be thought 
of as being infinitesimal, or being very small, though in the latter case 
the equation (3.2) should not be viewed as being exact, but instead 
only being true up to errors of mean o(dt) and third moment 0(dt 3 ). 
This is a special case of Ito's formula. It should be compared against 
the chain rule 

dF(X t ) =F x (X t )dX t 
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when t i-> X t is a smooth process. The non-smooth nature of Brow- 
nian motion causes the quadratic term in the Taylor expansion to be 
non-negligible, which explains 4 the additional term in (3.2), although 
the Holder continuity of this motion is sufficient to still be able to 
ignore terms that are of cubic order or higher. 

Let p(t, x) dx be the probability density function of B t ; by inspec- 
tion of the normal distribution, this is a smooth function for t > 0, 
but is a Dirac mass at \i at time t = 0. By definition of density 
function, 



JR 

for any Schwartz function F. Applying Lemma 3.1.11 and integrating 
by parts, we see that 



in the sense of (tempered) distributions (see e.g. [Ta2010, §1.13]). 
In other words, p is a (tempered distributional) solution to the heat 
equation (3.3). Indeed, since p is the Dirac mass at p at time t = 0, 
p for later times t is the fundamental solution of that equation from 
initial position /i. 

From the theory of PDE one can solve 5 the (distributional) heat 
equation with this initial data to obtain the unique solution 



Of course, this is also the density function of N(p,t)n, which is (un- 
surprisingly) consistent with the fact that B t = N(p,t). Thus we see 
why the normal distribution of the central limit theorem involves the 
same type of functions (i.e. Gaussians) as the fundamental solution 
of the heat equation. Indeed, one can use this argument to heuristi- 
cally derive the central limit theorem from the fundamental solution 
of the heat equation (cf. Section 2.2.7), although the derivation is 



In this spirit, one can summarise (the differential side of) Ito calculus informally 
by the heuristic equations dB t — 0((dt) 1 ' 2 ) and \dB t \ 2 — dt, with the understanding 
that all terms that arc o(dt) arc discarded. 
5 See for instance [Ta2010, §1.12]. 




(3.3) 



dtp = ~dxxP 




1 



e -\x-n\ 2 /2t 
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only heuristic because one first needs to know that some limiting 
distribution already exists (in the spirit of Exercise 3.1.2). 

Remark 3.1.13. Because we considered a Wiener process with a de- 
terministic initial position /i, the density function p was a Dirac mass 
at time t = 0. However, one can run exactly the same arguments for 
Wiener processes with stochastic initial position (see Remark 3.1.8), 
and one will still obtain the same heat equation (3.1.8), but now with 
a more general initial condition. 

We have related one-dimensional Brownian motion to the one- 
dimensional heat equation, but there is no difficulty establishing a 
similar relationship in higher dimensions. In a vector space R™, 
define a (continuous) Wiener process (X t ) t e[a.+oo) in with an 
initial position fi = (/ii, . . . , fi n ) G R" to be a process whose compo- 
nents (^t,i)*e[o,+oo) for i = 1, . . . ,n are independent Wiener processes 
with initial position ^. It is easy to see that such processes exist, 
are unique in distribution, and obey the same sort of properties as 
in Definition 3.1.1, but with the one-dimensional Gaussian distribu- 
tion N(fx, <t 2 )r, replaced by the n-dimensional analogue N(/j,, o~ 2 T)tv^ , 
which is given by the density function 

(27TCT)™/ 2 

where dx is now Lebesgue measure on R". 

Exercise 3.1.4. If (i? t ) te [ 0j+oo ) is an n-dimensional continuous Wiener 
process, show that 

±EF(B t )= 1 -E(AF)(B t ) 
whenever F : R™ — > R is smooth with all derivatives bounded, where 

dxY 



™ F) 2 



is the Laplacian of F. Conclude in particular that the density function 
p(t, x) dx of B t obeys the (distributional) heat equation 

dtp = ^Ap. 
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A simple but fundamental observation is that n-dimensional Brow- 
nian motion is rotation- invariant: more precisely, if (-^t)te[o,+oo) i s an 
n-dimensional Wiener process with initial position 0, and U € 0(n) 
is any orthogonal transformation on R™, then (UX t ) t e[o.+cc) 1S an- 
other Wiener process with initial position 0, and thus has the same 
distribution: 

(3-4) {UX t )te[o.,+oo) = (A t )te[o,+oo)- 

This is ultimately because the n-dimensional normal distributions 
N(0, ff 2 J) R ~ are manifestly rotation-invariant (see Exercise 2.2.13). 

Remark 3.1.14. One can also relate variable-coefficient heat equa- 
tions to variable-coefficient Brownian motion (X t ) te [o i + cx) ), in which 
the variance of an increment dX t is now only proportional to dt for 
infinitesimal dt rather than being equal to dt, with the constant of 
proportionality allowed to depend on the time t and on the position 
X t . One can also add drift terms by allowing the increment dX t to 
have a non-zero mean (which is also proportional to dt). This can be 
accomplished through the machinery of stochastic calculus, which we 
will not discuss in detail in this text. In a similar fashion, one can 
construct Brownian motion (and heat equations) on manifolds or on 
domains with boundary, though we will not discuss this topic here. 

Exercise 3.1.5. Let X be a real random variable of mean zero and 
variance 1. Define a stochastic process (X t ) te [o .+oo) by the formula 

X t := e^ t (X + B e 2t_ 1 ) 

where (-Bt)te[o,+oo) 1S a Wiener process with initial position zero that 
is independent of X . This process is known as an Ornstein- Uhlenbeck 
process. 

• Show that each X t has mean zero and variance 1. 

• Show that X t converges in distribution to N(0, 1)r as t — > 
oo. 

• If F : R — > R is smooth with all derivatives bounded, show 
that 

^-EF(X t ) = ELF(X t ) 
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where L is the Ornstein- Uhlenbeck operator 



Conclude that the density function p(t, x) of X t obeys (in a 
distributional sense, at least) the Ornstein- Uhlenbeck equa- 
tion 



d t p = L*p 



where the adjoint operator L* is given by 



L*p := p xx +d x {xp). 
• Show that the only probability density function p for which 



Remark 3.1.15. The heat kernel , J— Sr , e l x ^l 2 / 2 * in d dimensions 

(V27rt) d 

is absolutely integrable in time away from the initial time t = for 
dimensions d > 3, but becomes divergent in dimension 1 and (just 
barely) divergent for d=2. This causes the qualitative behaviour of 
Brownian motion B t in R d to be rather different in the two regimes. 
For instance, in dimensions d > 3 Brownian motion is transient] al- 
most surely one has B t — > oo as t — > oo. But in dimension d = 1 
Brownian motion is recurrent: for each xo <E R, one almost surely 
has B t = Xq for infinitely many t. In the critical dimension d = 2, 
Brownian motion turns out to not be recurrent, but is instead neigh- 
bourhood recurrent: almost surely, B t revisits every neighbourhood 
of xq at arbitrarily large times, but does not visit Xq itself for any 
positive time t. The study of Brownian motion and its relatives is 
in fact a huge and active area of study in modern probability theory, 
but will not be discussed in this course. 




3.1.4. Dyson Brownian motion. The space V of n x n Hcrmitian 
matrices can be viewed as a real vector space of dimension n 2 using 
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the Frobenius norm 

A^tr(A 2 y/*= (£4 + 2 J2 Re( aij r + Im(a tJ f X 

\ i— 1 1<*<J<™ 

where the coefficients of A. One can then identify V explicitly 

with R™ via the identification 

{aij)i<i,j<n = ((au)^ = i, (V2RG{aij),V2lm(aij))i<i < j< n ). 

Now that one has this indentification, for each Hermitian matrix 
A a e V (deterministic or stochastic) we can define a Wiener pro- 
cess {A t ) t £[o.+oo) on V with initial position Aq. By construction, 
we see that t i->- A t is almost surely continuous, and each increment 
A t+ —A t _ is equal to (t+ — i-) 1 / 2 times a matrix drawn from the gauss- 
ian unitary ensemble (GUE), with disjoint increments being jointly 
independent. In particular, the diagonal entries of A t+ — A t _ have 
distribution N(0,t+ — *-)r, and the off-diagonal entries have distri- 
bution N(0,t + -t-) c - 

Given any Hermitian matrix A, one can form the spectrum (Xi(A), . . . , \ n (A)), 
which lies in the Weyl chamber R™ := {(A 1; ...,A„) e R" : Ai > 
■ • > A„}. Taking the spectrum of the Wiener process (A t ) te [ . +oc ), 
we obtain a process 

(Ai(A t ), . . . , A„(At)) te [ 0i+oo ) 

in the Weyl cone. We abbreviate Xi(A t ) as Aj. 

For t > 0, we see that A t is absolutely continuously distributed 
in V. In particular, since almost every Hermitian matrix has simple 
spectrum, we see that A t has almost surely simple spectrum for t > 0. 
(The same is true for t = if we assume that A also has an absolutely 
continuous distribution.) 

The stochastic dynamics of this evolution can be described by 
Dyson Brownian motion[Dyl962]: 

Theorem 3.1.16 (Dyson Brownian motion). Let t > 0, and let dt > 

0, and let Ai, . . . , A„ be as above. Then we have 

(3.5) d\ i =dB i + V T A r + ... 
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for all 1 < i < n, where dXi :— Xi(A t +dt) — \{At), and dB\, . . . ,dB n 
are iid copies of N(0, dt)n which are jointly independent of {A t >) t >e[o.t}, 
and the error term . . . has mean o(dt) and third moment 0(dt 3 ) in 
the limit dt — > (holding t and n fixed). 

Using the language of Ito calculus, one usually views dt as infin- 
itesimal and drops the . . . error, thus giving the elegant formula 

dX t = dB t + V — 

\<]<n:j^i 

that shows that the eigenvalues Xi evolve by Brownian motion, com- 
bined with a deterministic repulsion force that repels nearby eigen- 
values from each other with a strength inversely proportional to the 
separation. One can extend the theorem to the t = case by a limiting 
argument provided that Aq has an absolutely continuous distribution. 
Note that the decay rate of the error . . . can depend on n, so it is not 
safe to let n go off to infinity while holding dt fixed. However, it is 
safe to let dt go to zero first, and then send n off to infinity 6 . 

Proof. Fix t. We can write A t+dt = A t + (dt) 1/2 G, where G is 
independent 7 of A t and has the GUE distribution. We now condition 
A t to be fixed, and establish (3.5) for almost every fixed choice of A t ; 
the general claim then follows upon undoing the conditioning (and 
applying the dominated convergence theorem). Due to independence, 
observe that G continues to have the GUE distribution even after 
conditioning A t to be fixed. 

Almost surely, A t has simple spectrum; so we may assume that 
the fixed choice of A t has simple spectrum also. The eigenvalues Xi 
now vary smoothly near t, so we may Taylor expand 

Xi(A t+dt ) = Xi + (dt)V2v G Ai + \dtV 2 G X t + 0((dtf/ 2 \\Gf) 

for sufficiently small dt, where Vg is directional differentiation in 
the G direction, and the implied constants in the 0() notation can 
depend on A t and n. In particular, we do not care what norm is used 
to measure G in. 



It is also possible, by being more explicit with the error terms, to work with dt 
being a specific negative power of n; see [TaVu2009b] . 

^Strictly speaking, G depends on dt. but this dependence will not concern us. 
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As G has the GUE distribution, the expectation and variance 
of ||G|| 3 is bounded (possibly with constant depending on n), so the 
error here has mean o(dt) and third moment 0(dt 3 ). We thus have 

dX t = {dt) 1 / 2 V G X l + ^dtVlh + .... 

Next, from the first and second Hadamard variation formulae (1.73), 
(1.74) we have 

V G A 4 = u*Gui 

and 

~~. Aj — Aj 

where Ui , . . . , u n are an orthonormal eigenbasis for A t , and thus 

d\ t = {dtY' 2 u*Gui + dt V V — r- + • • • • 

... Aj — Aj 

Now we take advantage of the unitary invariance of the Gaussian 
unitary ensemble (that is, that UGU* = G for all unitary matrices G; 
this is easiest to see by noting that the probability density function 
of G is proportional to exp(— ||G|||./2)). From this invariance, we 
can assume without loss of generality that u\,...,u n is the standard 
orthonormal basis of C™, so that we now have 

dx l = {dt) i ' 2 ^ + dtY j p^- + ... 

TT. Aj — A,- 

where £jj are the coefficients of G. But the £a are iid copies of 
-^(Oj 1)r; and the £ij are iid copies of N(Q, l)c, and the claim fol- 
lows (note that dt^^- ^ _ x . has mean zero and third moment 

o(dt 3 ).) □ 

Remark 3.1.17. Interestingly, one can interpret Dyson Brownian 
motion in a different way, namely as the motion of n independent 
Wiener processes Aj (t) after one conditions the Aj to be non- intersecting 
for all time; see [Grl999]. It is intuitively reasonable that this con- 
ditioning would cause a repulsion effect, though we do not know of 
a simple heuristic reason why this conditioning should end up giving 
the specific repulsion force present in (3.5). 
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In the previous section, we saw how a Wiener process led to 
a PDE (the heat flow equation) that could be used to derive the 
probability density function for each component X t of that process. 
We can do the same thing here: 

Exercise 3.1.6. Let Ai, . . . , A„ be as above. Let F : R™ — > R be a 
smooth function with bounded derivatives. Show that for any t > 0, 
one has 

fl e EF(Ai,...,A„) =ED*F(Ai,...,A„) 
where D* is the adjoint Dyson operator 

D*F:= l -Ydl.F+ Y P^-. 

i— 1 l<i,j<n:i^£j J 

If we let p : [0, +oo) x R" — s- R denote the density function p(t, •) : 
R> — > R of (Ai(t), . . . , X n (t)) at time t e [0, +oo), deduce the Dyson 
partial differential equation 

(3.6) d t p = Dp 

(in the sense of distributions, at least, and on the interior of R>), 
where D is the Dyson operator 

(3-7) Dp:=\±dl iP - £ ftJ^M- 

Z i=l l<i,i<n:i^j VAl Aj/ 

The Dyson partial differential equation (3.6) looks a bit compli- 
cated, but it can be simplified (formally, at least) by introducing the 
Vandermonde determinant 

(3.8) A„(A 1 ,...,A„) := J[ (A; - A,). 

l<i<j<n 

Exercise 3.1.7. Show that (3.8) is the determinant of the matrix 
W^iKijKn, and is also the sum E CT es„ s g n ( a ) U7=i 

Note that this determinant is non-zero on the interior of the Wcyl 
chamber R> . The significance of this determinant for us lies in the 
identity 

(3.9) «,A._ E A 
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which can be used to cancel off the second term in (3.7). Indeed, we 
have 

Exercise 3.1.8. Let p be a smooth solution to (3.6) in the interior 
of R™ , and write 

(3.10) p = A n u 

in this interior. Show that u obeys the linear heat equation 

1 n 

1 i= i 

in the interior of R>. (Hint: You may need to exploit the identity 

(a-b)(a-c) + (b-a)(b-c) + (c-a){c-b) = for distinct a, b, c. Equivalently, 
you may need to first establish that the Vandermonde determinant is 
a harmonic function.) 

Let p be the density function of the (Ai, . . . , A„), as in (3.1.6). 
Recall that the Wiener random matrix A t has a smooth distribution 
in the space V of Hermitian matrices, while the space of matrices in V 
with non-simple spectrum has codimension 3 by Exercise 1.3.10. On 
the other hand, the non-simple spectrum only has codimension 1 in 
the Weyl chamber (being the boundary of this cone). Because of this, 
we see that p vanishes to at least second order on the boundary of 
this cone (with correspondingly higher vanishing on higher codimen- 
sion facets of this boundary). Thus, the function u in Exercise 3.1.8 
vanishes to first order on this boundary (again with correspondingly 
higher vanishing on higher codimension facets). Thus, if we extend p 
symmetrically across the cone to all of R", and extend the function 
u antisymmetrically, then the equation (3.6) and the factorisation 
(3.10) extend (in the distributional sense) to all of R n . Extending 
(3.1.8) to this domain (and being somewhat careful with various is- 
sues involving distributions) , we now see that u obeys the linear heat 
equation on all of R n . 

Now suppose that the initial matrix A had a deterministic spec- 
trum v — (y\, . . . , v n ), which to avoid technicalities we will assume to 
be in the interior of the Weyl chamber (the boundary case then being 
obtainable by a limiting argument). Then p is initially the Dirac delta 



3.1. Brownian motion 



297 



function at v, extended symmetrically. Hence, u is initially A ^ 
times the Dirac delta function at v, extended antisymmetrically: 

u(0,X) = — ^— sgn^^A-^H- 
n[l/ > <xSS„ 

Using the fundamental solution for the heat equation in n dimensions, 
we conclude that 

= rdw* U sgn((T)e ~ 

v ; it6S„ 

By the Leibniz formula for determinants 

n 

det((a ij )i< iJ <„) = sgn(cr) JJa i(7(i ), 

<r£S„ t=l 



we can express the sum here as a determinant of the matrix 

) l<i,j<n • 



(■ e -(A«-^)72^ 



Applying (3.10), we conclude 

Theorem 3.1.18 (Johansson formula). Let A be a Hermitian ma- 
trix with simple spectrum v = (vi, . . . ,v n ), let t > 0, and let A t — 
Aq + f 1 / 2 G where G is drawn from GUE. Then the spectrum A = 
(Ai, . . . , A n ) of A t has probability density function 

(3.11) ^(W)=^^det(e-^>-/»,, s( , JSn 

on R| . 

This formula is given explicitly in [Jo2001], who cites [BrHil996] 
as inspiration. (One can also check by hand that (3.11) satisfies the 
Dyson equation (3.6).) 

We will be particularly interested in the case when Ao = and 
t = 1, so that we are studying the probability density function of the 
eigenvalues (Ai(G), . . . , X n (Gj) of a GUE matrix G. The Johansson 
formula does not directly apply here, because v is vanishing. However, 
we can investigate the limit of (3.11) in the limit as v — > inside 
the Weyl chamber; the Lipschitz nature of the eigenvalue operations 
A i \ Xi(A) (from the Weyl inequalities) tell us that if (3.11) converges 
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locally uniformly as v — > for A in the interior of R™ , then the limit 
will indeed 8 be the probability density function for v = 0. 

Exercise 3.1.9. Show that as v — > 0, we have the identities 

det(e-^-^ 2 /%< id < n = e ^l 2 / 2 e ^l^dct( e ^ )i<*,,<« 

and 

det(e Ai ^')i<i,j<n = Yj ^A„(A)A„(z/) + o(A„(i/)) 

locally uniformly in A. (Hint: for the second identity, use Taylor 
expansion and the Leibniz formula for determinants, noting the left- 
hand side vanishes whenever A n (u) vanishes and so can be treated 
by the (smooth) factor theorem.) 

From the above exercise, we conclude the fundamental Ginibre 
formula[Gil965] 

(3.12) „<A) . (2j) „ /i ' 1 ,,.,„, e - | ^|A„W|' 

for the density function for the spectrum (Ai(G), . . . , A„(G)) of a GUE 
matrix G. 

This formula can be derived by a variety of other means; we 
sketch one such way below. 

Exercise 3.1.10. For this exercise, assume that it is known that 
(3.12) is indeed a probability distribution on the Weyl chamber R™ (if 
not, one would have to replace the constant (2ir) n / 2 by an unspecified 
normalisation factor depending only on n). Let D = diag(Ai, . . . , A„) 
be drawn at random using the distribution (3.12), and let U be drawn 
at random from Haar measure on U(n). Show that the probability 
density function of UDU* at a matrix A with simple spectrum is 
equal to c n er^ A ^ F l 2 for some constant c„ > 0. (Hint: use unitary 
invariance to reduce to the case when A is diagonal. Now take a small 
e and consider what U and D must be in order for UDU* to lie within 
e of A in the Frobenius norm, performing first order calculations only 
(i.e. linearising and ignoring all terms of order o(e)).) 



Note from continuity that the density function cannot assign any mass to the 
boundary of the Weyl chamber, and in fact must vanish to at least second order by 
the previous discussion. 
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Conclude that (3.12) must be the probability density function of 
the spectrum of a GUE matrix. 

Exercise 3.1.11. Verify by hand that the self-similar extension 

p(t,x) :=t- n2 / 2 p(x/Vt) 

of the function (3.12) obeys the Dyson PDE (3.6). Why is this consis- 
tent with (3.12) being the density function for the spectrum of GUE? 

Remark 3.1.19. Similar explicit formulae exist for other invariant 
ensembles, such as the gaussian orthogonal ensemble GOE and the 
gaussian symplectic ensemble GSE. One can also replace the exponent 
in density functions such as e - "" 4 ^/ 2 with more general expressions 
than quadratic expressions of A. We will however not detail these 
formulae in this text (with the exception of the spectral distribution 
law for random iid Gaussian matrices, which we discuss in Section 
2.6). 

3.2. The Golden-Thompson inequality- 
Let A, B be two Hermitian n x n matrices. When A and B commute, 
we have the identity 

A+B A B 

e = e e . 

When A and B do not commute, the situation is more complicated; 
we have the B aker-Campbell-Haus dor ff formula 

e A+B = e A e B e-^ A ^... 

where the infinite product here is explicit but very messy. On the 
other hand, taking determinants we still have the identity 

Aet{e A+B )=dct(e A e B ). 

An identity in a somewhat similar spirit (which Percy Deift has half- 
jokingly termed "the most important identity in mathematics") is the 
formula 

(3.13) det(l + AB) =det(l + BA) 

whenever A,B are n x k and k x n matrices respectively (or more 
generally, A and B could be linear operators with sufficiently good 
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spectral properties that make both sides equal). Note that the left- 
hand side is an n x n determinant, while the right-hand side is a k x k 
determinant; this formula is particularly useful when computing de- 
terminants of large matrices (or of operators), as one can often use it 
to transform such determinants into much smaller determinants. In 
particular, the asymptotic behaviour ofnxn determinants as n — > oo 
can be converted via this formula to determinants of a fixed size (inde- 
pendent of n), which is often a more favourable situation to analyse. 
Unsurprisingly, this trick is particularly useful for understanding the 
asymptotic behaviour of determinantal processes. 

There are many ways to prove (3.13). One is to observe first 
that when A, B are invertible square matrices of the same size, that 
1 + BA and 1 + AB are conjugate to each other and thus clearly have 
the same determinant; a density argument then removes the invert- 
ibility hypothesis, and a padding-by-zeroes argument then extends 
the square case to the rectangular case. Another is to proceed via the 
spectral theorem, noting that AB and BA have the same non-zero 
eigenvalues. 

By rescaling, one obtains the variant identity 

det(z + AB) = z n - k det(z + BA) 

which essentially relates the characteristic polynomial of AB with 
that of BA. When n = k, a comparison of coefficients this al- 
ready gives important basic identities such as tv(AB) = tr(BA) and 
det(AB) = det(BA); when n is larger than k, an inspection of the 
z n-k coe fg c j en t similarly gives the Cauchy-Binet formula 

(3.14) dct(BA)= J2 dct(A Sx[k] )det(B [k]xS ) 

5E(H) 

where S ranges over all fc-element subsets of [n] := {1, . . . , n}, A$ x [k] 
is the k x k minor of A coming from the rows S, and B^ x s is sim- 
ilarly the k x k minor coming from the columns S. Unsurprisingly, 
the Cauchy-Binet formula is also quite useful when performing com- 
putations on determinantal processes. 
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There is another very nice relationship between e A+B and e A e B , 
namely the Golden-Thompson inequality [Gol965, Thl965] 

(3.15) tr(e A+B ) < tr(e A e B ). 

The remarkable thing about this inequality is that no commutativity 
hypotheses whatsoever on the matrices A, B are required. Note that 
the right-hand side can be rearranged using the cyclic property of 
trace as tr(e s / 2 e A e s / 2 ); the expression inside the trace is positive 
definite so the right-hand side is positive 9 . 

To get a sense of how delicate the Golden-Thompson inequality 
is, let us expand both sides to fourth order in A, B. The left-hand 
side expands as 

tr 1 + tv{A + B) + X - tr(A 2 + AB + BA + B 2 ) + ^ tr(A + B) 3 

+ ^tr(A + B) 4 + ... 
while the right-hand side expands as 

tr 1 + tr(A + B) + ^ tr(^ 2 + 2AB + B 2 ) 

+ - tv(A 3 + 3A 2 B + 3AB 2 + B 3 ) 
6 

+ — tr(A 4 + 4A 3 B + 6A 2 B 2 + 4AB 3 + B 4 ) + . . . 
24 

Using the cyclic property of trace tr(AB) = tr(BA), one can verify 
that all terms up to third order agree. Turning to the fourth order 
terms, one sees after expanding out (A+B) 4 and using the cyclic prop- 
erty of trace as much as possible, we see that the fourth order terms 
almost agree, but the left-hand side contains a term ^ tr(ABAB) 
whose counterpart on the right-hand side is j^ti(ABBA). The dif- 
ference between the two can be factorised (again using the cyclic 
property of trace) as — trL4, B] 2 . Since [A,B] := AB - BA is 
skew-Hermitian, — [A, B} 2 is positive definite, and so we have proven 
the Golden-Thompson inequality to fourth order 10 . 

In contrast, the obvious extension of the Golden-Thompson inequality to three 
or more Hcrmitian matrices fails dramatically; there is no reason why expressions such 
as tr(e A e B e c ) need to be positive or even real. 

l^Onc could also have used the Cauchy-Schwarz inequality for the Frobenius norm 
to establish this; see below. 
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Intuitively, the Golden-Thompson inequality is asserting that in- 
teractions between a pair A, B of non-commuting Hcrmitian matrices 
are strongest when cross-interactions are kept to a minimum, so that 
all the A factors lie on one side of a product and all the B factors lie 
on the other. Indeed, this theme will be running through the proof 
of this inequality, to which we now turn. 

The proof of the Golden-Thompson inequality relies on the some- 
what magical power of the tensor power trick (see [Ta2008, §1.9]). 
For any even integer p — 2,4, 6, . . . and any n x n matrix A (not 
necessarily Hermitian), we define the p-Schatten norm \\A\\ p of A by 
the formula 11 

\\A\\ p := (tv(AA*) p/2 ) 1/p . 
This norm can be viewed as a non- commutative analogue of the £ p 
norm; indeed, the p-Schatten norm of a diagonal matrix is just the 
£ p norm of the coefficients. 

Note that the 2-Schatten norm 

\\A\\ 2 := (tr(AA*)) 1/2 

is the Hilbcrt space norm associated to the Frobenius inner product 
(or Hilbert- Schmidt inner product) 

(A, B) :=tr(AB*). 

This is clearly a non-negative Hermitian inner product, so by the 
Cauchy-Schwarz inequality we conclude that 

^{A.A^^WA.MMh 

for any nx n matrices Ai,A 2 - As HA2II2 = ll^lhj we conclude in 
particular that 

|tr(AiA 2 )| < P1II2P2II2 

We can iterate this and establish the non- commutative Holder 
inequality 

(3.16) I tr(AiA 2 ...A p )\< ||Ai||pP 2 ||p • • • \\A p \\p 

whenever p = 2, 4, 8, ... is an even power of 2 (compare with Exercise 
1.3.9). Indeed, we induct on p, the case p — 2 already having been 



This formula in fact defines a norm for any p > 1; see Exercise 1.3.22(vi). 
However, we will only need the even integer case here. 
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established. If p > 4 is a power of 2, then by the induction hypothesis 
(grouping A\ . . . A p into p/2 pairs) we can bound 



(3.17) Itr^iAa...^)! < || A^^^A^A^^ . . . \\ A p ^A p \\ p/2 . 



We use the cyclic property of trace to move the rightmost A\ factor to 
the left. Applying the induction hypothesis again, we conclude that 



\\ArA 2 \\ p J z 2 < WAIA^WA^,, . . . \\AlA^\\ p , 2 \\A 2 A^\\ p/2 . 



and similarly for H^a^H^i etc. Inserting this into (3.17) we obtain 



Remark 3.2.1. Though we will not need to do so here, it is inter- 
esting to note that one can use the tensor power trick to amplify 
(3.16) for p equal to a power of two, to obtain (3.16) for all positive 
integers p, at least when the Ai are all Hermitian (again, compare 
with Exercise 1.3.9). Indeed, pick a large integer m and let N be 
the integer part of 2 m /p. Then expand the left-hand side of (3.16) 



as tr(A{ /N . . . A\ /N A\ /N . . . Al /N . . . A%/ N ) and apply (3.16) with p 
replaced by 2 m to bound this by \\A\ ,N \\% m . . . \\A)/ N \\$ m ||l||l™ _pJV . 



Sending m — > oo (noting that 2 m = (1 + o(\))Np) we obtain the 



Specialising (3.16) to the case where A\ = . . . = A p = AB for 
some Hermitian matrices A, B, we conclude that 



On the other hand, we may expand 

ll^i^H^a = tvA 1 A 2 A* 2 A* 1 . ..A 1 A 2 A* 2 A* 1 . 




(3.16). 



claim. 



tr((AB)") < \\AB\\ 



p 
v 



and hence by cyclic permutation 



tr((AB) p ) < tr{(A 2 B 2 ) p / 2 ) 



for any p — 2,4, 

(3.18) 



Iterating this we conclude that 



tr((AB) p ) < tr(A p B p ). 
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Applying this with A, B replaced by e A / p and e B l p respectively, we 
obtain 

tv{{e A/p e B ' p ) p ) < tr(e A e B ). 

Now we send p -> oo. Since e A/p = 1 + A/p + 0(l/p 2 ) and e B/p = 
l + B/p + 0(l/p 2 ), we have e A ' p e B ' p = e (A+B)/ P +o(i/ P 2 ) ^ and SQ 
the left-hand side is tr(e A+B+ °( 1 / p )); taking the limit as p — > oo we 
obtain the Golden- Thompson inequality 12 

If we stop the iteration at an earlier point, then the same argu- 
ment gives the inequality 

\\e A+B \\ p <\\e A e B \\ p 

for p = 2, 4, 8, ... a power of two; one can view the original Golden- 
Thompson inequality as the p = 1 endpoint of this case in some 
sense 13 . In the limit p — > oo, we obtain in particular the operator 
norm inequality 

(3-19) ||e A+B ||o P < ||e A e B || op 

This inequality has a nice consequence: 

Corollary 3.2.2. Let A,B be Hermitian matrices. If e A < e B (i.e. 
e B — e is positive semi-definite), then A < B. 

Proof. Since e A < e B , we have (e A x, x) < (e B x,x) for all vectors 
x, or in other words 1 1 e" 4 / 2 ^ 1 1 < He- 8 / 2 ^) for all x. This implies that 
e A/2 e -B/2 ig a con t r action, i.e. \\e A ^ e - B / 2 \\ op < 1. By (3.19), we 
conclude that ||e( A ~ s )/ 2 || op < 1, thus (A - B)/2 < 0, and the claim 
follows. □ 

Exercise 3.2.1. Reverse the above argument and conclude that (3.2.2) 
is in fact equivalent to (3.19). 

It is remarkably tricky to try to prove Corollary 3.2.2 directly. 
Here is a somewhat messy proof. By the fundamental theorem of 
calculus, it suffices to show that whenever A(t) is a Hermitian ma- 
trix depending smoothly on a real parameter with ^e A ^ > 0, then 



12 See also [Ve2008] for a slight variant of this proof. 

-^In fact, the Golden-Thompson inequality is true in any operator norm; sec 
[Bhl997, Theorem 9.3.7]. 
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^A(t) > 0. Indeed, Corollary 3.2.2 follows from this claim by setting 
A(t) := log(e A + t(e B - e A )) and concluding that A(l) > A(0). 

To obtain this claim, wc use the Duhamel formula 

dt J K dt y " 

This formula can be proven by Taylor expansion, or by carefully ap- 
proximating e A ^ by (1 + A(t)/N) N ; alternatively, one can integrate 
the identity 

d {e sA { t)d sm = e _ sA{t) d A A[t) 

os at at 

which follows from the product rule and by interchanging the s and t 
derivatives at a key juncture. We rearrange the Duhamel formula as 

(l A/t \ A l + \ /O / / 



e sA(t)^ A{j . ))e -sA(t) ds)e A(t)/2_ 

-1/2 dt 



dt 1-1/2 



Using the basic identity e A Be A = e ad ^B, we thus have 

*A(t) = e A(t)/2 [{ f 1 ' 2 e aad(A(t)) ^(J^^WA 

dt J -1/2 dt 



formally evaluating the integral, we obtain 

d A (t) _ A(t)/2 ^H^(A(t))/2) d A(m A(t)/2 

dt C ~ C [ ad(A(i))/2 [ dt (t))le 

and thus 

^_ A(f) _ ad(A(t))/2 -A(t)/2 ( d A(t)) -A(t)/2s 

dt {) sinh(ad(A(i))/2) 1 { dt ' h 

As gje" 4 ^ was positive semi-definite by hypothesis, e ~ A( ~ t ^ 2 (-^e A< -V)e~ A< - t ^ 2 
is also. It thus suffices to show that for any Hermitian A, the operator 
sinh(ad(A)) preserves the property of being semi-definite. 

Note that for any real £, the operator e 27Tl i ad (. A ) maps a posi- 
tive semi-definite matrix B to another positive semi-definite matrix, 
namely e 2nl ^ A Be^ 27! ' l ^ A . By the Fourier inversion formula, it thus suf- 
fices to show that the kernel F(x) := ■ s is positive semi-definite in 
the sense that it has non-negative Fourier transform (this is a special 
case of Bochner's theorem). But a routine (but somewhat tedious) 
application of contour integration shows that the Fourier transform 



306 



3. Related articles 



HO = In e~ 2 ^F{x) dx is given by the formula F(Q = Sco J^ y 
and the claim follows. 

Because of the Golden- Thompson inequality, many applications 
of the exponential moment method in commutative probability theory 
can be extended without difficulty to the non-commutative case, as 
was observed in [AhWi2002]. For instance, consider (a special case 
of) the Chernoff inequality 

P(Xi + • • • + X N > A) < max(e- A2 / 4 , e Acr/2 ^ 

for any A > 0, where X\, . . . ,Xn = X are iid scalar random variables 
taking values in [—1, 1] of mean zero and with total variance a 2 (i.e. 
each factor has variance a 2 /N). We briefly recall the standard proof 
of this inequality from Section 2.1. We first use Markov's inequality 
to obtain 

V{X 1 + ■ ■ ■ + X N > A) < e -^Ee* (Xl+ - +Xjv) 

for some parameter t > to be optimised later. In the scalar case, 
we can factor e*' XlH as e tXl . . . e tXN and then use the iid hy- 

pothesis to write the right-hand side as 

e~ tx (Ee tx ) N . 

An elementary Taylor series computation then reveals the bound 
Ee tx < exp(i 2 cr 2 /N) when < t < 1; inserting this bound and 
optimising in t we obtain the claim. 

Now suppose that X\ , . . . , X n = X are iid d x d Hermitian ma- 
trices. One can try to adapt the above method to control the size of 
the sum X x + ■ ■ ■ + X N . The key point is then to bound expressions 
such as 

Etr e *(*i+-+ x «). 

As Xi, . . . ,Xm need not commute, we cannot separate the product 
completely. But by Golden-Thompson, we can bound this expression 

by 

E tr e l ^ Xl ^ hXjv_i) e tXiv 

which by independence we can then factorise as 
tr(Ee t(Xl+ - +Xiv - l) )(Ee t ^ N ). 
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As the matrices involved are positive definite, we can then take out 
the final factor in operator norm: 

||Ee tx "|| op tr Ee t(Xl+ - +Xjy - l) . 

Iterating this procedure, we can eventually obtain the bound 
Etre t(x 1+ -+x«) < || Ee «||Ar_ 

Combining this with the rest of the Chernoff inequality argument, wc 
can establish a matrix generalisation 

P(\\X! + ■ ■ ■ + XnWop > A) < nmax(e- A2 / 4 ,e- Aff/2 ) 

of the Chernoff inequality, under the assumption that the X\, . . . , Xjj 
are iid with mean zero, have operator norm bounded by 1, and have 
total variance J2iLi l|E^j 2 ||o P equal to a 2 ; see for instance [Ve2008] 
for details. 

Further discussion of the use of the Golden-Thompson inequality 
and its variants to non-commutative Chernoff-type inequalities can 
be found in [Gr2009], [Ve2008], [Tr2010]. It seems that the use of 
this inequality may be quite useful in simplifying the proofs of several 
of the basic estimates in this subject. 



3.3. The Dyson and Airy kernels of GUE via 
semiclassical analysis 

Let n be a large integer, and let M n be the Gaussian Unitary En- 
semble (GUE), i.e. the random Hcrmitian matrix with probability 
distribution 

Cne -tr(AO/2 dMn 

where dM n is a Haar measure on Hcrmitian matrices and C n is 
the normalisation constant required to make the distribution of unit 
mass. The eigenvalues Ai < ... < A„ of this matrix are then a 
coupled family of n real random variables. For any 1 < k < n, we 
can define the k-point correlation function p k {x\, . . . ,Xk) to be the 
unique symmetric measure on R fe such that 

/ F{xi, . . . ,x k )p k (xi, ■ ■ ■ ,x k ) = E V F(X il ,...,X ik ). 
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A standard computation (given for instance in Section 2.6 gives the 
Ginibre formula[GH965] 

p n (x u ...,x n ) = C' n ( H \ Xi - Xj f) e -^7=iM'/\ 

l<i<j<n 

for the n-point correlation function, where C' n is another normali- 
sation constant. Using Vandcrmondc determinants, one can rewrite 
this expression in determinantal form as 

p n (xi, . . . , x n ) — C'n det(K n (xi, xj)) 

where the kernel K n is given by 

n-l 

K n (x,y) := ^2 4>k{x)4> k {y) 

k=0 

where 4>k{x) := Pk(x)e~ x2 / 4 and P ,Pi,... are the (L 2 -normalised) 
Hermite polynomials (thus the <pk are an orthonormal family, with 
each Pfc being a polynomial of degree k). Integrating out one or more 
of the variables, one is led to the Gaudin-Mehta formula 14 

(3.20) pk(xi, ■■■,x k )= det{K n (xi,Xj))i<ij< k . 

Again, see Section 2.6 for details. 

The functions <f>k(%) can be viewed as an orthonormal basis of 
eigenfunctions for the harmonic oscillator operator 

indeed it is a classical fact that 

L(j> k = (k+ ]j)<Pk- 

As such, the kernel K n can be viewed as the integral kernel of the 
spectral projection operator l(_ oo n+ i](L). 

From (3.20) we see that the fine-scale structure of the eigenvalues 
of GUE are controlled by the asymptotics of K n as n — > oo. The two 
main asymptotics of interest are given by the following lemmas: 



In particular, the normalisation constant C^' in the previous formula turns out 
to simply be equal to 1. 
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Lemma 3.3.1 (Asymptotics of K n in the bulk). Let x a e (—2,2), 
and let p sc (xo) := ^:(4 — Xq) 1 / 2 be the semicircular law density at Xq. 
Then, we have 

y % 

K n (x ^/rl + —= — r,X \/^ + —j= — r) 

Vnp sc (x ) V n Psc{x ) 

_^ sm(ir(y - z)) 

n(y - z) 

as n — > oo for any fixed y, z G R (removing the singularity at y = z 
in the usual manner). 

Lemma 3.3.2 (Asymptotics of K n at the edge). We have 



n l/6> v n l/6 ; 

(3 - 22) ^ Ai(y) Ai'(z) - Ai'(y) Ai(z) 

y — z 

asfi^oo for any fixed y, z € R, where Ai is i/ie Airy function 

i r 00 t 3 

Ai(x) := / cos( htx) (it 

^ Jo 3 

and again removing the singularity at y = z in the usual manner. 

The proof of these asymptotics usually proceeds via computing 
the asymptotics of Hermite polynomials, together with the Christoffcl- 
Darboux formula; this is for instance the approach taken in Section 
2.6. However, there is a slightly different approach that is closer in 
spirit to the methods of semi-classical analysis. For sake of complete- 
ness, we will discuss this approach here, although to focus on the 
main ideas, the derivation will not be completely rigorous 15 . 

3.3.1. The bulk asymptotics. We begin with the bulk asymp- 
totics, Lemma 3.3.1. Fix xq in the bulk region (—2,2). Applying the 
change of variables 

y 



x = x \/n + 



Vnp sc (x ) 



In particular, we will ignore issues such as convcgcncc of integrals or of opera- 
tors, or (removable) singularities in kernels caused by zeroes in the denominator. For 
a rigorous approach to these asymptotics in the discrete setting, see [012008] . 
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we see that the harmonic oscillator L becomes 

d 2 1 y 

-np sc (x ) -po + -{x ^ + -j= ^^) 2 

dy 2 4 s/np sc {x ) 

Since K n is the integral kernel of the spectral projection to the region 
L < n+ 1, we conclude that the left-hand side of (3.21) (as a function 
of y, z) is the integral kernel of the spectral projection to the region 

i \2 d 2 i / r- y \2 i 

-np sc {x ) — + -{x^n+ — — - <n+-. 

dy 2 4 V n Psc{xo) 2 

Isolating out the top order terms in n, we can rearrange this as 

<„» + <,(!). 

Thus, in the limit n — > oo, we expect (heuristically, at least) that 
the left-hand side of (3.21) to converge as n — > oo to the integral 
kernel of the spectral projection to the region 

-^<n 2 . 
dy 2 

Introducing the Fourier dual variable £ to y, as manifested by the 
Fourier transform 

m) = f e- 2 ^y.f(y) dy 
Jr 

and its inverse 

F(y) = f e 2 ^F(0 d£, 
Jr 

then we (heuristically) have ^ = 27ri£, and so we are now projecting 
to the region 

(3.23) |C| 2 < 1/4, 



i.e. we are restricting the Fourier variable to the interval [—1/2, 1/2]. 
Back in physical space, the associated projection P thus takes the 
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form 

Pf(y) = I 

-1/2,1/2] 



Pf(y) = f e 2 ^"/(0 d£ 

•/[-1/2.1/2] 

/ f e 2^y e -2^z ^ /(z) dz 

JR >/[-l/2,l/2] 

/(*) dz 



i 



-1/2,1/2] 

sin(7r(y - z)) 



and the claim follows. 

Remark 3.3.3. From a semiclassical perspective, the original spec- 
tral projection L < n + | can be expressed in phase space (using the 
dual frequency variable rj to x) as the ellipse 

(3.24) 4ttV + — < n + - 

4 2 

which after the indicated change of variables becomes the elongated 
ellipse 

2 1 1 2 

5 + 2n /9sc (x )(4-x 2 ) 2/+ 4n 2 p sc (a ; o) 2 (4-x 2 ) 2/ 



1 1 

< T + 



4 ' 2n(4-a; 2 1 ) 

which converges (in some suitably weak sense) to the strip (3.23) as 



n — y co. 



3.3.2. The edge asymptotics. A similar (heuristic) argument gives 
the edge asymptotics, Lemma 3.3.2. Starting with the change of vari- 
ables 

V 



x 



n l/6 

the harmonic oscillator L now becomes 

-n 1/3 — -u -O./Hj. _J!_i-' 



'— + -(2^+^) 2 . 
dy 2 4 V n 1 / 6 



Thus, the left-hand side of (3.22) becomes the kernel of the spectral 
projection to the region 

dy 2 4 V n 1 / 6 2 



312 



3. Related articles 



Expanding out, computing all terms of size n 1 / 3 or larger, and rear- 
ranging, this (heuristically) becomes 

d 2 

-<v +y - 0(1) 

and so, heuristically at least, we expect (3.22) to converge to the 
kernel of the projection to the region 

(3.25) -t +V -"- 

To compute this, we again pass to the Fourier variable £, converting 
the above to 



47 ^ 2 + 2^° 

using the usual Fourier-analytic correspondences between multiplica- 
tion and differentiation. If we then use the integrating factor trans- 
formation 

we can convert the above region to 

^< 

which on undoing the Fourier transformation becomes 

y<o, 

and the spectral projection operation for this is simply the spatial 
multiplier l(_ 00j o]- Thus, informally at least, we see that the spectral 
projection P to the region (3.25) is given by the formula 

P = M- 1 1 ( _ CO , 0] M 

where the Fourier multiplier M is given by the formula 

Mf(0 :=e 8 - 3 * 3 /3; (0 . 



1 d 
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In other words (ignoring issues about convergence of the integrals), 
Mf(y) = f (f e 2 ™^e 8 ^ 3/3 e- 2 ™ z « d£)f(z) dz 

= 2 f ( f cos(27r(y - z)£ + 8ir 3 f/3) d£)f(z) dz 

JR JO 



( / cos(t(y -z) + r/3) dt)f(z) dz 

R JO 



and similarly 



/ Ai(i/ - z)f(z) dz 

JR 

M- 1 f(z)= f Ai(y-z)f(y) 

JR 



(this reflects the unitary nature of M). We thus see (formally, at 
least) that 

Pf(y) = [ ( / Ai(y - w) Ai(z - «;) dtu)/(*) dz. 

JTL J(-oo,0] 

To simplify this expression we perform some computations closely 
related to the ones above. From the Fourier representation 



1 r°° 

Ai (y) = - / cos(ty + t 3 /3) dt 
t Jo 



we see that 
which means that 

and thus 



thus obeys the Airy equation 

A\"(y)=yM(y). 
Using this, one soon computes that 

d AUy — w) Ai'(z — w) — Ai'(y — w) AUz — w) ... . ... 

v - '- ^ '- y - '- — ^ '- = Atiy-w) AUz-w). 

dw y-z vy ' K ' 
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Also, stationary phase asymptotics tell us that Ai(y) decays exponen- 
tially fast as y — > +00, and hence Ai{y — w) decays exponentially fast 
as w — > —00 for fixed y; similarly for Ai'(z — w),Ai'(y — w),Ai(z — w). 
From the fundamental theorem of calculus, we conclude that 

Ai(y) Ai'(z)-Ai'(y)Ai(z) 



L 



Ai(y — w) Ai(z — w) dw = 



'(-00,0] y - z 

(this is a continuous analogue of the Christoffcl-Darboux formula), 
and the claim follows. 

Remark 3.3.4. As in the bulk case, one can take a semi-classical 
analysis perspective and track what is going on in phase space. With 
the scaling we have selected, the ellipse (3.24) has become 

, 2 m, 2 (2Jn + y/n 1 / 6 ) 2 i 
4ttV /3 £ 2 + ^ ^ '— < n + - , 

which we can rearrange as the eccentric ellipse 



2n l/3 4n 2/3 

which is converging as n — > 00 to the parabolic region 

4tt 2 £ 2 + y < 

which can then be shifted to the half-plane y < by the parabolic 
shear transformation (y, £) (y + 47r 2 ^ 2 ,^), which is the canonical 
relation of the Fourier multiplier M. (The rapid decay of the kernel 
Ai of M at +00 is then reflected in the fact that this transformation 
only shears to the right and not the left.) 

Remark 3.3.5. Presumably one should also be able to apply the 
same heuristics to other invariant ensembles, such as those given by 
probability distributions of the form 

Cn e-^ p (M n )) dMn 

for some potential function P. Certainly one can soon get to an 
orthogonal polynomial formulation of the determinantal kernel for 
such ensembles, but I do not know if the projection operators for 
such kernels can be viewed as spectral projections to a phase space 
region as was the case for GUE. But if one could do this, this would 
provide a heuristic explanation as to the universality phenomenon 
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for such ensembles, as Taylor expansion shows that all (reasonably 
smooth) regions of phase space converge to universal limits (such as a 
strip or paraboloid) after rescaling around either a non-critical point 
or a critical point of the region with the appropriate normalisation. 



3.4. The mesoscopic structure of GUE 
eigenvalues 

In this section we give a heuristic model of the mesoscopic structure of 
the eigenvalues Ai < . . . < A n of the nxn Gaussian Unitary Ensemble 
(GUE), where n is a large integer. From Section 2.6, the probability 
density of these eigenvalues is given by the Ginibre distribution 

^-e- H ^ d\ 

where dX = dXi . . . d\ n is Lebesgue measure on the Weyl chamber 
{(Ai, . . . , A„) € R" : Ai < . . . < A„}, Z n is a constant, and the 
Hamiltonian H is given by the formula 

"A 2 

ff(A 1 ,...,A„):=^^-2 Yl l°g|A*-A,|. 

j = l l<i<j<n 

As we saw in Section 2.4, at the macroscopic scale of y/n, the eigen- 
values Xj are distributed according to the Wigner semicircle law 

p sc (x) : = J-(4-a; 2 # 2 . 

Indeed, if one defines the classical location of the i th eigenvalue to 
be the unique solution in [— 2\/n, 2 v / n] to the equation 

Psc(x) dx = - 

then it is known that the random variable Ai is quite close to jf. In- 
deed, a result of Gustavsson[Gu2005] shows that, in the bulk region 
when en < i < (1 — e)n for some fixed e > 0, A^ is distributed asymp- 
totically as a Gaussian random variable with mean jf and variance 

r- 1 , ;n ■ Note that from the semicircular law, the factor 

^ p 1 is the mean eigenvalue spacing. 
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At the other extreme, at the microscopic scale of the mean eigen- 
value spacing (which is comparable to 1/ y/n in the bulk, but can be 
as large as n~ 1//6 at the edge), the eigenvalues are asymptotically dis- 
tributed with respect to a special determinantal point process, namely 
the Dyson sine process in the bulk (and the Airy process on the edge) , 
as discussed in Section 3.3. 

We now focus on the mesoscopic structure of the eigenvalues, 
in which one involves scales that are intermediate between the mi- 
croscopic scale l/y/n and the macroscopic scale \pa, for instance in 
correlating the eigenvalues A^ and Aj in the regime \i — j\ ~ n e for 
some < 6 < 1. Here, there is a surprising phenomenon; there is 
quite a long-range correlation between such eigenvalues. The results 
from [Gu2005] shows that both \ and Xj behave asymptotically like 
Gaussian random variables, but a further result from the same paper 
shows that the correlation between these two random variables is as- 
ymptotic to 1 — 9 (in the bulk, at least); thus, for instance, adjacent 
eigenvalues Aj+i and A^ are almost perfectly correlated (which makes 
sense, as their spacing is much less than either of their standard de- 
viations), but that even very distant eigenvalues, such as A„/ 4 and 
A3 n /4, have a correlation comparable to 1/logn. One way to get a 
sense of this is to look at the trace 

Ai + ••• + A„. 

This is also the sum of the diagonal entries of a GUE matrix, and is 
thus normally distributed with a variance of n. In contrast, each of 
the Ai (in the bulk, at least) has a variance comparable to log n/n. 
In order for these two facts to be consistent, the average correlation 
between pairs of eigenvalues then has to be of the order of 1/ log n. 

In this section we will a heuristic way to see this correlation, 
based on Taylor expansion of the convex Hamiltonian H(X) around 
the minimum 7, which gives a conceptual probabilistic model for the 
mesoscopic structure of the GUE eigenvalues. While this heuristic 
is in no way rigorous, it does seem to explain many of the features 
currently known or conjectured about GUE, and looks likely to extend 
also to other models. 
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3.4.1. Fekete points. It is easy to see that the Hamiltonian H(X) 
is convex in the Weyl chamber, and goes to infinity on the boundary 
of this chamber, so it must have a unique minimum, at a set of points 
7 = (71, . . . ,7„) known as the Fekete points. At the minimum, we 
have V£f(7) = 0, which expands to become the set of conditions 



(3.26) 7j _ 2 ^— L 







for all 1 < j < n. To solve these conditions, we introduce the monic 
degree n polynomial 

n 

P(x) :=!](* - 7i ). 

i=l 

Differentiating this polynomial, we observe that 



n 

(3.27) P'(x)=P(x)J2 — 

and 

P"(x) = P(x) J2 
Using the identity 



x — 7, x — 7^ 



1 1 1 

+ 



x-nx-ij x-'jiji-jj x - 7^ 7j - 7, 
followed by (3.26), we can rearrange this as 

7* 



P"(x) = P(x) J2 



l<i<n:i^j n 

Comparing this with (3.27), we conclude that 

P"(x) = xP'(x)-nP(x), 
or in other words that P is the n th Hermite polyomial 
P(x) = H n (x) := (-l)« e * 2 /2_l e -, 2 /2. 

Thus the Fekete points 7* are nothing more than the zeroes of the n th 
Hermite polynomial. 
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Heuristically, one can study these zeroes by looking at the func- 
tion 

4>(x) := P(x)e- x2 / 4 
which solves the eigenfunction equation 

Comparing this equation with the harmonic oscillator equation <f>"(x)+ 
k 2 (f>(x) = 0, which has plane wave solutions <j){x) = Acos(kx + 6) 
for k 2 positive and exponentially decaying solutions for k 2 nega- 
tive, we are led (heuristically, at least) to conclude that <p is concen- 
trated in the region where n — ^- is positive (i.e. inside the interval 

[—2y/n, and will oscillate at frequency roughly \Jn — ^ inside 

this region. As such, we expect the Fekete points ji to obey the 
same spacing law as the classical locations 7? 1 ; indeed it is possible 
to show that 7 * = -ff + 0(l/y/n) in the bulk (with some standard 
modifications at the edge). In particular, we have the heuristic 

(3.28) 7i-7j«(*-j')/v^ 
for i,j in the bulk. 

Remark 3.4.1. If one works with the circular unitary ensemble 
(CUE) instead of the CUE, in which M n is drawn from the uni- 
tary n x n matrices using Haar measure, the Fekete points become 
equally spaced around the unit circle, so that this heuristic essentially 
becomes exact. 

3.4.2. Taylor expansion. Now we expand around the Fekete points 
by making the ansatz 

^■i 7* ~t~ -^ii 

thus the results of [Gu2005] predict that each Xi is normally dis- 
tributed with standard deviation 0(y/\ogn/y/n) (in the bulk). We 
Taylor expand 

H(X) = ff( 7 ) + Vff(7)(i) + \^ 2 H ( 7 )(x, x) + . . . . 

We heuristically drop the cubic and higher order terms. The constant 
term H (7) can be absorbed into the partition constant Z n , while the 
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linear term vanishes by the property VH (7) of the Fekete points. We 
are thus lead to a quadratic (i.e. Gaussian) model 

l_ e -^ 2 H(^ X ) dx 
Z 'n 

for the probability distribution of the shifts Xi, where Z' n is the ap- 
propriate normalisation constant. 

Direct computation allows us to expand the quadratic form \ \7 2 H (7) 

as 

iv*( T )(,,,)-£f + £ f^$- 

j=l l<i<j<n w 

The Taylor expansion is not particularly accurate when j and i are 
too close, say j = i + 0(\og ^ n), but we will ignore this issue as it 
should only affect the microscopic behaviour rather than the meso- 
scopic behaviour. This models the Xi as (coupled) Gaussian random 
variables whose covariance matrix can in principle be explicitly com- 
puted by inverting the matrix of the quadratic form. Instead of doing 
this precisely, we shall instead work heuristically (and somewhat inac- 
curately) by re-expressing the quadratic form in the Haar basis. For 
simplicity, let us assume that n is a power of 2. Then the Haar basis 
consists of the basis vector 

V>0 := -^(1,...,1) 

together with the basis vectors 



for every discrete dyadic interval / C {1, . . . , n} of length between 2 
and n, where and I r are the left and right halves of /, and 1/, , 
lj r G R™ are the vectors that are one on Ii,I r respectively and zero 
elsewhere. These form an orthonormal basis of R™, thus we can write 



for some coefficients £o 5 £j- 



320 



3. Related articles 



From orthonormality we have 



n 2 



+ 



E^ 2 



and we have 





A standard heuristic wavelet computation using (3.28) suggests that 
C/ ; j is small unless J and J are actually equal, in which case one has 

n 



(in the bulk, at least). Actually, the decay of the Ci t j away from the 
diagonal I = J is not so large, because the Haar wavelets tpi have 
poor moment and regularity properties. But one could in principle 
use much smoother and much more balanced wavelets, in which case 
the decay should be much faster. 

This suggests that the GUE distribution could be modeled by the 
distribution 



for some absolute constant C; thus we may model £ = N(Q, 1) and 
£/ = C y/\T\yfngi for some iid Gaussians gj = N(0, 1) independent 
of £o- We then have as a model 



for the fluctuations of the eigenvalues (in the bulk, at least), leading 
of course to the model 



Cl,I 



(3.29) 



l e -€o 2 /2 e -C£,^?^ 




(3.30) 
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for the fluctuations themselves. This model does not capture the 
microscopic behaviour of the eigenvalues such as the sine kernel (in- 
deed, as noted before, the contribution of the very short I (which 
corresponds to very small values of \j — i\) is inaccurate), but ap- 
pears to be a good model to describe the mesoscopic behaviour. For 
instance, observe that for each i there are ~ logn independent nor- 
malised Gaussians in the above sum, and so this model is consistent 
with the result of Gustavsson that each Aj is Gaussian with standard 



of Xi,Xj share about (1 — 6)\ogn of the logn terms in the sum in 
common, which is consistent with the further result of Gustavsson 
that the correlation between such eigenvalues is comparable to 1 — 9. 

If one looks at the gap A, + i — Aj using (3.30) (and replacing the 
Haar cutoff 1/, (i) — l/ r (i) by something smoother for the purposes of 
computing the gap), one is led to a heuristic of the form 



The dominant terms here are the first term and the contribution of 
the very short intervals /. At present, this model cannot be accurate, 
because it predicts that the gap can sometimes be negative; the con- 
tribution of the very short intervals must instead be replaced some 
other model that gives sine process behaviour, but we do not know 
of an easy way to set up a plausible such model. 

On the other hand, the model suggests that the gaps are largely 
decoupled from each other, and have Gaussian tails. Standard heuris- 
tics then suggest that of the ~ n gaps in the bulk, the largest one 



should be comparable to y -^jp^, which was indeed established re- 
cently in [BeBo2010]. 

Given any probability measure \i = p dx on R™ (or on the Weyl 
chamber) with a smooth nonzero density, one can can create an as- 
sociated heat flow on other smooth probability measures / dx by 
performing gradient flow with respect to the Dirichlct form 




. Also, if | 



i — j\ ~ n e , then the expansions (3.30) 
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Using the ansatz (3.29), this flow decouples into a system of indepen- 
dent Ornstein-Uhlenbeck processes 

d£ = -£ Q dt + dW 

and 

dg I = C" — (-g I dt + dW I ) 

where dWo,dWi arc independent Wiener processes (i.e. Brownian 
motion). This is a toy model for the Dyson Brownian motion (see 
Section 3.1). In this model, we sec that the mixing time for each 
gi is 0(\I\/n); thus, the large-scale variables (gj for large I) evolve 
very slowly by Dyson Brownian motion, taking as long as 0(1) to 
reach equilibrium, while the fine scale modes (gi for small I) can 
achieve equilibrium in as brief a time as 0(l/n), with the interme- 
diate modes taking an intermediate amount of time to reach equilib- 
rium. It is precisely this picture that underlies the Erdos-Schlein-Yau 
approach[ErScYa2009] to universality for Wigncr matrices via the 
local equilibrium flow, in which the measure (3.29) is given an ad- 
ditional (artificial) weight, roughly of the shape e _n £ (£o+£i£/) ; i n 
order to make equilibrium achieved globally in just time 0(n 1_e ), 
leading to a local log-Sobolev type inequality that ensures conver- 
gence of the local statistics once one controls a Dirichlet form con- 
nected to the local equilibrium measure; and then one can use the 
localisation of eigenvalues provided by a local semicircle law to con- 
trol that Dirichlet form in turn for measures that have undergone 
Dyson Brownian motion. 
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